Image Credit: Lukas
With the emergence of the big data, new jobs have appeared demanding new sets of skills and expertise for extracting value from data (Axis Talent, 2020):
Which one is the most valued in the Mexican labor market currently?
To identify which data job category has the highest salary in the Mexican labor market in February 2023 according to the OCC website.
Which data job category has the highest salary in the Mexican labor market in February 2023 according to the OCC website?
The Data Scientist position has the highest salary in the Mexican labor market in February 2023 according to the OCC website.
The methodology of the present study is based on Rollin’s Foundational Methodology for Data Science (Rollins, 2015).
Furthermore:
In this context, the purpose of the present notebook is to perform the process of data exploration, data preparation, data analysis, data visualization as well as the statistical analysis.
# Loading Requirements Text File
#!pip install -r requirements.txt
# Libraries installation
!pip install -U kaleido
!pip install pip==21.2 # Pip version for successfully using the method get_installed_distributions
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting kaleido
Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.9/79.9 MB 9.5 MB/s eta 0:00:00
Installing collected packages: kaleido
Successfully installed kaleido-0.2.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip==21.2
Downloading pip-21.2-py3-none-any.whl (1.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 4.3 MB/s eta 0:00:00
WARNING: The candidate selected for download or install is a yanked version: 'pip' candidate (version 21.2 at https://files.pythonhosted.org/packages/03/0f/b125bfdd145c1d018d75ce87603e7e9ff2416e742c71b5ac7deba13ca699/pip-21.2-py3-none-any.whl#sha256=71f447dff669d8e2f72b880e3d7ddea2c85cfeba0d14f3307f66fc40ff755176 (from https://pypi.org/simple/pip/) (requires-python:>=3.6))
Reason for being yanked: See https://github.com/pypa/pip/issues/8711
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 22.0.4
Uninstalling pip-22.0.4:
Successfully uninstalled pip-22.0.4
Successfully installed pip-21.2
# Libraries importation
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from IPython.display import set_matplotlib_formats
import seaborn as sns
import folium
from folium.plugins import MarkerCluster
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import scipy.stats as stats
import statsmodels.stats.diagnostic as std
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.utils import resample
from pip._internal.utils.misc import get_installed_distributions
import sys
import os
# Setting theme and plot resolution
sns.set_theme(context = 'notebook', style = 'darkgrid')
mpl.rcParams["figure.dpi"] = 100
mpl.rcParams["savefig.dpi"] = 300
set_matplotlib_formats('svg')
# Setting default plot's aesthetics
plotfontcolor = 'dimgray'
mpl.rcParams['text.color'] = plotfontcolor
mpl.rcParams['axes.labelcolor'] = plotfontcolor
mpl.rcParams['xtick.color'] = plotfontcolor
mpl.rcParams['ytick.color'] = plotfontcolor
mpl.rcParams["font.size"] = 10
mpl.rcParams['axes.titlesize'] = 14
mpl.rcParams['axes.labelsize'] = 12
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['legend.fontsize'] = 11
mpl.rcParams["axes.labelweight"] = "bold"
mpl.rcParams["axes.titleweight"] = "bold"
#mpl.rcParams['font.family'] = 'sans-serif'
#mpl.rcParams['font.family'] = 'serif'
# Configuration for exporting Plotly charts for publication
config = {
'toImageButtonOptions': {
'format': 'png', # one of png, svg, jpeg, webp
'filename': 'Plot',
'scale': 2 # Multiply title/legend/axis/canvas sizes by this factor
}
}
# Setting Font Properties for Plotly
font_px = dict(size=14, color ='dimgray')
# Setting Heatmap Color Scale for Plotly
heatmap_px_colorscale = [(0, "#000000"),
(0.03, px.colors.sequential.Blues[8]),
(0.25, px.colors.sequential.Blues[6]),
(0.5, px.colors.sequential.Blues[4]),
(0.75, px.colors.sequential.Blues[2]),
(1, "aliceblue")]
Firstly, the data collected from the web scraping process was loaded to a Pandas dataframe. Then, the dataset was explored with the Pandas and Seaborn libraries to gain some understanding and preliminary insights.
# Data importation
df = pd.read_csv('https://raw.githubusercontent.com/DanielEduardoLopez/DataJobsMX2023/main/Dataset.csv')
df.head(10)
| Job | Salary | Company | Location | |
|---|---|---|---|---|
| 0 | ANALISTA DE DATOS | Sueldo no mostrado por la empresa | NaN | Cuautitlán Izcalli, Edo. Méx. |
| 1 | ANALISTA DE DATOS | Sueldo no mostrado por la empresa | Hitss | CDMX |
| 2 | Analista de Datos | $20,000 - $25,000 Mensual | Aceros Levinson S.A. d... | Monterrey, N.L. |
| 3 | Analista de datos (por incapacidad) | $17,000 - $18,000 Mensual | Sky | Benito Juárez, CDMX |
| 4 | Analista de Datos y Costos | $15,000 - $18,000 Mensual | NaN | San Miguel de Allende, Gto. |
| 5 | Analista de Datos Bussines Intelligence | Sueldo no mostrado por la empresa | Comercializadora de Va... | Guadalajara, Jal. |
| 6 | Analista de datos maestros SAP | Sueldo no mostrado por la empresa | Productos Verde Valle,... | Zapopan, Jal. |
| 7 | Analista base de datos | $30,000 - $35,000 Mensual | Servicios TVM de Méxic... | Miguel Hidalgo, CDMX |
| 8 | Bases de Datos Analista | $15,000 Mensual | Manpower, S.A. de C.V. | San Luis Potosí, SLP. |
| 9 | Analista de Base de Datos | $15,000 - $16,000 Mensual | Grupo Daisa | Cuauhtémoc, CDMX |
It can be observed that data lacks consistency in every variable. Some observations are in uppercase, some values are abbreviations, there is no salary value for each tuple, some observations do not correspond to data jobs, etc. Thus, it is necessary to clean the data. Nonetheless, some further exploration will be carried out to develop a more complete strategy for doing so.
# Basic dataset description
df.describe()
| Job | Salary | Company | Location | |
|---|---|---|---|---|
| count | 1292 | 1292 | 1061 | 1237 |
| unique | 899 | 249 | 439 | 181 |
| top | Analista de Tecnología de Datos para Contac Ce... | Sueldo no mostrado por la empresa | Manpower, S.A. de C.V. | CDMX |
| freq | 12 | 655 | 46 | 289 |
All of the variables are strings. This is expectable for the Job, Company, and Location variables; but no for Salary, which must be appropriately wrangled.
# Basic dataset info
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1292 entries, 0 to 1291 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Job 1292 non-null object 1 Salary 1292 non-null object 2 Company 1061 non-null object 3 Location 1237 non-null object dtypes: object(4) memory usage: 40.5+ KB
From the results above it can be seen that the variables Job and Salary do not exhibit null values, while Company and Location do. Indeed, in some cases the organization offering the vacancy prefers to remain confidential.
# Value counts for each job position
df[['Job']].value_counts()
Job
Analista de Tecnología de Datos para Contac Center 12
Arquitecto de Datos 10
FCA (Fiat Chrysler) Plant Facilities Engineer Specialist 10
Analista en Prevención de Fraudes | Querétaro 9
Data engineer 9
..
Auxiliar de supervisión académica 1
Auxiliar de tráfico 1
Azure Data Engineer 1
BASE DE DATOS POSTGRETS 1
Customer Service - Order Entry Coordinator - EP 1
Length: 899, dtype: int64
932 different positions titles are present in the data set, while only 5 are required!
# Visual exploration of the Job variable (top 20)
fig, ax = plt.subplots(figsize = (7, 7))
sns.barplot(x = df['Job'].value_counts().values[0:20], y= df['Job'].value_counts().index[0:20], palette="Blues_r", alpha = 0.9)
plt.xlabel('Frequency')
plt.ylabel('Job title')
plt.title('Most Common Data Jobs in Dataset')
plt.show()
It appears that Analista de Tecnología de Datos para Contact Center is the most demanded position, followed by Arquitecto de datos and FCA (Fiat Chrysler) Plant Facilities Engineer Specialist.
# Visual exploration of the Job variable
fig, ax = plt.subplots(figsize = (10, 7))
sns.scatterplot(x = df['Job'].unique(), y= df['Job'].value_counts(), color="Green", alpha = 0.9)
plt.xticks([])
plt.xlabel('Job Title')
plt.ylabel('Frequency')
plt.title('Job Titles Frequency in Dataset')
plt.show()
The plot above shows that the largest part of the job titles retrieved from the web scraping are unique values. Thus, this suggests that the job titles variable is fairly inconsistent and/or it exhibits a lot of other positions which are not of interest for this study.
df['Location'].value_counts()
CDMX 289
Monterrey, N.L. 95
Guadalajara, Jal. 55
Miguel Hidalgo, CDMX 51
Querétaro, Qro. 33
...
San José Iturbide, Gto. 1
El Derramadero, Saltillo, Coah. 1
Tab. 1
Metepec, Edo. Méx. 1
Polanco V Sección, Miguel Hidalgo, CDMX 1
Name: Location, Length: 181, dtype: int64
CDMX is by far the location where most of the data jobs are demanded.
# Visual exploration of the Location variable (top 20)
fig, ax = plt.subplots(figsize = (7, 7))
sns.barplot(x = df['Location'].value_counts().values[0:20], y= df['Location'].value_counts().index[0:20], palette="Blues_r", alpha = 0.9)
plt.xlabel('Frequency')
plt.ylabel('Location')
plt.title('Most Common Locations in Dataset')
plt.show()
Some states appear with its full name and some with its abbreviation. In Furthermore, in some cases, only the name of the state appears while in others the name of the municipality is also included.
Notwithstanding with the above, it is clear that CDMX (Mexico City) is the location where the most data jobs are demanded.
# Value counts for each company
df['Company'].value_counts()
Manpower, S.A. de C.V. 46
SIEMENS SA DE CV 28
Grupo Salinas 25
Atento Servicios, S. A... 24
Softtek 16
..
Universidad Tecnologic... 1
SERVICIOS FULTRA SA DE... 1
JLL ENGINEERING SERVICES 1
Bayer de México 1
Eaton Corporation 1
Name: Company, Length: 439, dtype: int64
# Visual exploration of the Company variable (top 20)
fig, ax = plt.subplots(figsize = (7, 7))
sns.barplot(x = df['Company'].value_counts().values[0:20], y= df['Company'].value_counts().index[0:20], palette="Blues_r", alpha = 0.9)
plt.xlabel('Frequency')
plt.ylabel('Company')
plt.title('Most Common Companies in Dataset')
plt.show()
Manpower is the company with the highest number of data jobs vacancies. It is followed by Siemens, Grupo Salinas and Atento Servicios.
# Visual exploration of the Company variable
fig, ax = plt.subplots(figsize = (10, 7))
sns.scatterplot(x = df['Company'].value_counts().index, y= df['Company'].value_counts().values, color="Salmon", alpha = 0.9)
plt.xticks([])
plt.xlabel('Company')
plt.ylabel('Frequency')
plt.title('Companies Frequency in Dataset')
plt.show()
The plot above shows that most companies only offer one or two data-related positions.
# Value counts for each Salary
df['Salary'].value_counts()
Sueldo no mostrado por la empresa 655
$15,000 Mensual 30
$20,000 - $25,000 Mensual 24
$7,600 - $9,000 Mensual 12
$15,000 - $18,000 Mensual 12
...
$35,000 - $36,000 Mensual 1
$80,000 - $95,000 Mensual 1
$63,000 - $64,000 Mensual 1
$80,000 - $85,000 Mensual 1
$15,000 - $15,500 Mensual 1
Name: Salary, Length: 249, dtype: int64
# Visual exploration of the Company variable (top 20)
fig, ax = plt.subplots(figsize = (7, 7))
sns.barplot(x = df['Salary'].value_counts().values[0:20], y= df['Salary'].value_counts().index[0:20], palette="Blues_r", alpha = 0.9)
plt.xlabel('Frequency')
plt.ylabel('Salary')
plt.title('Salaries in Dataset')
plt.show()
Obviously, is necessary to clean the data in order to get more meaningful insights regarding the salary. However, it is noteworthy that most of the published vacancies have an undisclosed salary.
For the vacancies that have a disclosed salary, a salary range is most commonly published rather than a single value.
Of course, the periodicity of the payment ("Mensual": monthly) must be removed from all observations.
On the other hand, it "appears" that a salary of about 15,000 MXN is the most common, which is probably not accurate due to the fact that the salary variable is a string instead of an actual number.
After the data was explored, some wrangling strategies to effectively clean the data were defined. In this sense, some of the procedures applied to the data were:
From the raw data obtained through web scraping, it has been noticed that much of the vacancies published lack of true salary data. However, those positions without said data might be useful for assessing the geographical distribution of the data jobs in Mexico as well as identiying which companies demand more data specialists even if they have not published a proposed salary.
Thus, two datasets were prepared:
# Drop of duplicates
df = df.drop_duplicates()
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1021 entries, 0 to 1290 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Job 1021 non-null object 1 Salary 1021 non-null object 2 Company 850 non-null object 3 Location 974 non-null object dtypes: object(4) memory usage: 39.9+ KB
This section shows the process carried out to harmonize the job titles of the data-related positions and the drop of non data-related jobs.
To do so, a data jobs dictionary using keys in both English and Spanish was defined. If a certain vacancy comprised all the terms in the keys, then, the job title was renamed to a standard job title.
# Creating job dictionary
job_dict = {
('data', 'analyst'): "Data Analyst",
('analista', 'datos'): "Data Analyst",
('data', 'scientist'): "Data Scientist",
('científico', 'datos'): "Data Scientist",
('cientifico', 'datos'): "Data Scientist",
('data', 'engineer'): "Data Engineer",
('ingeniero', 'datos'): "Data Engineer",
('business', 'analyst'): "Business Analyst",
('analista', 'negocio'): "Business Analyst",
('data', 'architect'): "Data Architect",
('arquitecto', 'datos'): "Data Architect"
}
# Proof of concept
list(job_dict.keys())[0][0] in 'data analyst' and list(job_dict.keys())[0][1] in 'data analyst'
True
# Conversion of job observations to lower case
df['Job'] = df['Job'].str.casefold()
# Cleaning of the variable Job
for i in range(len(job_dict)):
df['Job'] = df['Job'].map(lambda x: list(job_dict.values())[i] if (list(job_dict.keys())[i][0] in x and list(job_dict.keys())[i][1] in x) else x)
df.head(10)
| Job | Salary | Company | Location | |
|---|---|---|---|---|
| 0 | Data Analyst | Sueldo no mostrado por la empresa | NaN | Cuautitlán Izcalli, Edo. Méx. |
| 1 | Data Analyst | Sueldo no mostrado por la empresa | Hitss | CDMX |
| 2 | Data Analyst | $20,000 - $25,000 Mensual | Aceros Levinson S.A. d... | Monterrey, N.L. |
| 3 | Data Analyst | $17,000 - $18,000 Mensual | Sky | Benito Juárez, CDMX |
| 4 | Data Analyst | $15,000 - $18,000 Mensual | NaN | San Miguel de Allende, Gto. |
| 5 | Data Analyst | Sueldo no mostrado por la empresa | Comercializadora de Va... | Guadalajara, Jal. |
| 6 | Data Analyst | Sueldo no mostrado por la empresa | Productos Verde Valle,... | Zapopan, Jal. |
| 7 | Data Analyst | $30,000 - $35,000 Mensual | Servicios TVM de Méxic... | Miguel Hidalgo, CDMX |
| 8 | Data Analyst | $15,000 Mensual | Manpower, S.A. de C.V. | San Luis Potosí, SLP. |
| 9 | Data Analyst | $15,000 - $16,000 Mensual | Grupo Daisa | Cuauhtémoc, CDMX |
# Visual exploration of the pre-cleaned Job variable (first 15 positions)
fig, ax = plt.subplots(figsize = (10, 6))
sns.barplot(x = df['Job'].value_counts()[:15], y= df['Job'].value_counts().keys()[:15], palette="Blues_r", alpha = 0.9)
plt.title('First 15 Job Titles by Frequency')
plt.xlabel('Frequency')
plt.ylabel('Job Title')
plt.show()
In the plot above it is possible to observe:
# Drop of observations not corresponding to Data Jobs
df = df.loc[(df['Job'] == 'Data Analyst') | (df['Job'] == 'Business Analyst') | (df['Job'] == 'Data Engineer') | (df['Job'] == 'Data Scientist') | (df['Job'] == 'Data Architect')]
df['Job'].value_counts()
Data Analyst 91 Data Scientist 52 Data Engineer 48 Data Architect 39 Business Analyst 28 Name: Job, dtype: int64
# Resulting number of observations
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 258 entries, 0 to 1220 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Job 258 non-null object 1 Salary 258 non-null object 2 Company 217 non-null object 3 Location 239 non-null object dtypes: object(4) memory usage: 10.1+ KB
# Visual exploration of the cleaned Job variable (first 15 positions)
fig, ax = plt.subplots(figsize = (8, 4))
sns.barplot(x = df['Job'].value_counts()[:15], y= df['Job'].value_counts().keys()[:15], palette="Blues_r", alpha = 0.7)
plt.title('Frequency per Job Title')
plt.xlabel('Frequency')
plt.ylabel('Job Title')
plt.show()
From the plot above, it is possible to see that the variable Job has been successfully cleaned as data-related jobs are the only observations in the data set and their job titles are consistent. It is also noteworthy that the Data Analyst is the most demanded position in February 2023.
This section shows the process carried out to replace the null values for a string legend of "Not Disclosed" and to provide consistency to the letter case in the company names of the retrieved data.
# Cleaning of the Company variable
df['Company'] = df['Company'].apply(lambda x: "Not Disclosed" if pd.isnull(x) else x)
df['Company'] = df['Company'].apply(lambda x: str.title(str(x)))
df.head(10)
| Job | Salary | Company | Location | |
|---|---|---|---|---|
| 0 | Data Analyst | Sueldo no mostrado por la empresa | Not Disclosed | Cuautitlán Izcalli, Edo. Méx. |
| 1 | Data Analyst | Sueldo no mostrado por la empresa | Hitss | CDMX |
| 2 | Data Analyst | $20,000 - $25,000 Mensual | Aceros Levinson S.A. D... | Monterrey, N.L. |
| 3 | Data Analyst | $17,000 - $18,000 Mensual | Sky | Benito Juárez, CDMX |
| 4 | Data Analyst | $15,000 - $18,000 Mensual | Not Disclosed | San Miguel de Allende, Gto. |
| 5 | Data Analyst | Sueldo no mostrado por la empresa | Comercializadora De Va... | Guadalajara, Jal. |
| 6 | Data Analyst | Sueldo no mostrado por la empresa | Productos Verde Valle,... | Zapopan, Jal. |
| 7 | Data Analyst | $30,000 - $35,000 Mensual | Servicios Tvm De Méxic... | Miguel Hidalgo, CDMX |
| 8 | Data Analyst | $15,000 Mensual | Manpower, S.A. De C.V. | San Luis Potosí, SLP. |
| 9 | Data Analyst | $15,000 - $16,000 Mensual | Grupo Daisa | Cuauhtémoc, CDMX |
# Visual exploration of the cleaned Company variable
fig, ax = plt.subplots(figsize = (8, 6))
sns.barplot(x = df['Company'].value_counts()[:15], y= df['Company'].value_counts().keys()[:15], palette="Blues_r", alpha = 0.8)
plt.title('Top 15 Companies Offering Data Jobs')
plt.xlabel('Frequency')
plt.ylabel('Company')
plt.show()
The plot above suggests that the letter case of the different company names is now consistent. So, it is deemed that the variable Company has been successfully cleaned.
This section shows the process carried out to harmonize the name of the Mexican States in which the data-related positions are demanded.
To do so, firstly, the state component was extracted from the location data.
# Extraction of the state from the location data
df['Location'] = df['Location'].apply(lambda x: "Not Disclosed" if pd.isnull(x) else x)
df['Location'] = df['Location'].apply(lambda x: str(x).split(",")[-1] if len(str(x).split(",")) > 1 else x)
df['Location'] = df['Location'].apply(lambda x: str(x.replace('\u200b', ' ')).strip())
df.head(10)
| Job | Salary | Company | Location | |
|---|---|---|---|---|
| 0 | Data Analyst | Sueldo no mostrado por la empresa | Not Disclosed | Edo. Méx. |
| 1 | Data Analyst | Sueldo no mostrado por la empresa | Hitss | CDMX |
| 2 | Data Analyst | $20,000 - $25,000 Mensual | Aceros Levinson S.A. D... | N.L. |
| 3 | Data Analyst | $17,000 - $18,000 Mensual | Sky | CDMX |
| 4 | Data Analyst | $15,000 - $18,000 Mensual | Not Disclosed | Gto. |
| 5 | Data Analyst | Sueldo no mostrado por la empresa | Comercializadora De Va... | Jal. |
| 6 | Data Analyst | Sueldo no mostrado por la empresa | Productos Verde Valle,... | Jal. |
| 7 | Data Analyst | $30,000 - $35,000 Mensual | Servicios Tvm De Méxic... | CDMX |
| 8 | Data Analyst | $15,000 Mensual | Manpower, S.A. De C.V. | SLP. |
| 9 | Data Analyst | $15,000 - $16,000 Mensual | Grupo Daisa | CDMX |
Then, a dictionary was defined using the abbreviations in the keys and the full state names in the values. After that, the dictionary was passed to the data in order to replace the abbreviations by the full state names.
# Dictionary with the retrieved Location values
location_dict = {
'CDMX': 'Ciudad de México',
'Chih.': 'Chihuahua',
'Edo. Méx.': 'Estado de México',
'Gto.': 'Guanajuato',
'Jal.': 'Jalisco',
'Q. Roo': 'Quintana Roo',
'N. L.': 'Nuevo León',
'N.L.': 'Nuevo León',
'Pue.': 'Puebla',
'México': 'Estado de México',
'Zac.': 'Zacatecas',
'Tamps.': 'Tamaulipas',
'Mor.': 'Morelos',
'Sin.': 'Sinaloa',
'Oax.': 'Oaxaca',
'Qro.': 'Querétaro',
'Mich.': 'Michoacán',
'Son.': 'Sonora',
'BC.': 'Baja California',
'SLP.': 'San Luis Potosí',
'Yuc.': 'Yucatán',
'Coah.': 'Coahuila',
'BCS.': 'Baja California Sur',
'Nay.': 'Nayarit',
'Ags.': 'Aguascalientes',
'Hgo.': 'Hidalgo',
'Chis.': 'Chiapas',
}
# Cleaning of the Location variable in the dataset
for i in range(len(location_dict)):
df['Location'] = df['Location'].apply(lambda x: list(location_dict.values())[i] if x == list(location_dict.keys())[i] else x)
df.head(10)
| Job | Salary | Company | Location | |
|---|---|---|---|---|
| 0 | Data Analyst | Sueldo no mostrado por la empresa | Not Disclosed | Estado de México |
| 1 | Data Analyst | Sueldo no mostrado por la empresa | Hitss | Ciudad de México |
| 2 | Data Analyst | $20,000 - $25,000 Mensual | Aceros Levinson S.A. D... | Nuevo León |
| 3 | Data Analyst | $17,000 - $18,000 Mensual | Sky | Ciudad de México |
| 4 | Data Analyst | $15,000 - $18,000 Mensual | Not Disclosed | Guanajuato |
| 5 | Data Analyst | Sueldo no mostrado por la empresa | Comercializadora De Va... | Jalisco |
| 6 | Data Analyst | Sueldo no mostrado por la empresa | Productos Verde Valle,... | Jalisco |
| 7 | Data Analyst | $30,000 - $35,000 Mensual | Servicios Tvm De Méxic... | Ciudad de México |
| 8 | Data Analyst | $15,000 Mensual | Manpower, S.A. De C.V. | San Luis Potosí |
| 9 | Data Analyst | $15,000 - $16,000 Mensual | Grupo Daisa | Ciudad de México |
# Visual exploration of the cleaned Location variable
fig, ax = plt.subplots(figsize = (5, 8))
sns.barplot(x = df['Location'].value_counts().values, y= df['Location'].value_counts().index, palette="Spectral_r", alpha = 0.9)
plt.title('Frequency per Mexican State')
plt.xlabel('Frequency')
plt.ylabel('State')
plt.show()
The plot above shows that the names of the Mexican states where the data job vacancies are demanded is now consistent. So, the variable Location has been successfully cleaned.
This section shows the process carried out to wrangle and clean the salary data. To do so, the salary observations were cleaned from useless characters, then, the salary ranges were splitted into two columns and a average salary was calculated when applicable. Finally, non useful Salary columns were dropped.
# Checking if all the Salary observations are disclosed in a monthly basis
df[df['Salary'].apply(lambda x: True if ('Mensual' not in str(x)) and (x is not np.nan) else False) & (df['Salary'] != 'Sueldo no mostrado por la empresa')]
| Job | Salary | Company | Location |
|---|
Thus, all of the observations correspond to monthly salaries.
# Replacing the wording of the positions without disclosed salaries with the legend "Not Disclosed"
df['Salary'] = df['Salary'].apply(lambda x: 'Not Disclosed' if x == 'Sueldo no mostrado por la empresa' else x)
df.head()
| Job | Salary | Company | Location | |
|---|---|---|---|---|
| 0 | Data Analyst | Not Disclosed | Not Disclosed | Estado de México |
| 1 | Data Analyst | Not Disclosed | Hitss | Ciudad de México |
| 2 | Data Analyst | $20,000 - $25,000 Mensual | Aceros Levinson S.A. D... | Nuevo León |
| 3 | Data Analyst | $17,000 - $18,000 Mensual | Sky | Ciudad de México |
| 4 | Data Analyst | $15,000 - $18,000 Mensual | Not Disclosed | Guanajuato |
# Removing useless characters from column values
char_remove = {
'Anual': '',
'Mensual': '',
'$': '',
',': ''}
for key, value in char_remove.items():
df['Salary'] = df['Salary'].str.replace(key, value).str.strip()
df.head()
<ipython-input-36-efe8f30cb4d3>:9: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True. df['Salary'] = df['Salary'].str.replace(key, value).str.strip()
| Job | Salary | Company | Location | |
|---|---|---|---|---|
| 0 | Data Analyst | Not Disclosed | Not Disclosed | Estado de México |
| 1 | Data Analyst | Not Disclosed | Hitss | Ciudad de México |
| 2 | Data Analyst | 20000 - 25000 | Aceros Levinson S.A. D... | Nuevo León |
| 3 | Data Analyst | 17000 - 18000 | Sky | Ciudad de México |
| 4 | Data Analyst | 15000 - 18000 | Not Disclosed | Guanajuato |
# Splitting Salary column values into two columns
df['Salary'] = df['Salary'].apply(lambda x: np.nan if x == "Not Disclosed" else x)
df[['Min Salary', 'Max Salary']] = df['Salary'].str.split('-', 1, expand=True)
df['Min Salary'] = df['Min Salary'].str.strip().astype(np.float64)
df['Max Salary'] = df['Max Salary'].str.strip().astype(np.float64)
df.head()
| Job | Salary | Company | Location | Min Salary | Max Salary | |
|---|---|---|---|---|---|---|
| 0 | Data Analyst | NaN | Not Disclosed | Estado de México | NaN | NaN |
| 1 | Data Analyst | NaN | Hitss | Ciudad de México | NaN | NaN |
| 2 | Data Analyst | 20000 - 25000 | Aceros Levinson S.A. D... | Nuevo León | 20000.0 | 25000.0 |
| 3 | Data Analyst | 17000 - 18000 | Sky | Ciudad de México | 17000.0 | 18000.0 |
| 4 | Data Analyst | 15000 - 18000 | Not Disclosed | Guanajuato | 15000.0 | 18000.0 |
# Calculating average salary for each vacancy
df['Avg Salary'] = (df['Min Salary'] + df['Max Salary']) / 2
df.head()
| Job | Salary | Company | Location | Min Salary | Max Salary | Avg Salary | |
|---|---|---|---|---|---|---|---|
| 0 | Data Analyst | NaN | Not Disclosed | Estado de México | NaN | NaN | NaN |
| 1 | Data Analyst | NaN | Hitss | Ciudad de México | NaN | NaN | NaN |
| 2 | Data Analyst | 20000 - 25000 | Aceros Levinson S.A. D... | Nuevo León | 20000.0 | 25000.0 | 22500.0 |
| 3 | Data Analyst | 17000 - 18000 | Sky | Ciudad de México | 17000.0 | 18000.0 | 17500.0 |
| 4 | Data Analyst | 15000 - 18000 | Not Disclosed | Guanajuato | 15000.0 | 18000.0 | 16500.0 |
# Visual exploration of the cleaned Salary variable
fig, ax = plt.subplots(figsize = (10, 7))
sns.histplot(data= df['Avg Salary'], color = sns.color_palette('Blues_r')[0], alpha = 0.6)
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Frequency')
plt.show()
The histogram above suggests that all Salary observations have been transformed from a range in a string format into actual numbers. So, it is considered that the variable Salary has been successfully cleaned. On the other hand, it is noteworthy that some observations fall well beyond the main salary distribution. Let's take a look into those observations.
# Exploring positions whose salary is more than $80,000 MXN per month
df[df['Avg Salary'] > 80000]
| Job | Salary | Company | Location | Min Salary | Max Salary | Avg Salary | |
|---|---|---|---|---|---|---|---|
| 701 | Data Engineer | 95000 - 105000 | Not Disclosed | Ciudad de México | 95000.0 | 105000.0 | 100000.0 |
| 794 | Data Architect | 95000 - 105000 | Especialistas En Talen... | Ciudad de México | 95000.0 | 105000.0 | 100000.0 |
| 805 | Data Architect | 135000 - 140000 | Manpower, S.A. De C.V. | Not Disclosed | 135000.0 | 140000.0 | 137500.0 |
| 807 | Data Architect | 130000 - 145000 | Manpower, S.A. De C.V. | Ciudad de México | 130000.0 | 145000.0 | 137500.0 |
From the above results, it is plausible that Data Engineer and Data Architect positions earns an extremely high salaries as the ones showed as such positions require advanced programming skills and expertise.
Finally, the original Salary column was dropped.
# Dropping of the original Salary column
df = df.drop(columns = ['Salary'])
df.head()
| Job | Company | Location | Min Salary | Max Salary | Avg Salary | |
|---|---|---|---|---|---|---|
| 0 | Data Analyst | Not Disclosed | Estado de México | NaN | NaN | NaN |
| 1 | Data Analyst | Hitss | Ciudad de México | NaN | NaN | NaN |
| 2 | Data Analyst | Aceros Levinson S.A. D... | Nuevo León | 20000.0 | 25000.0 | 22500.0 |
| 3 | Data Analyst | Sky | Ciudad de México | 17000.0 | 18000.0 | 17500.0 |
| 4 | Data Analyst | Not Disclosed | Guanajuato | 15000.0 | 18000.0 | 16500.0 |
df.describe()
| Min Salary | Max Salary | Avg Salary | |
|---|---|---|---|
| count | 105.000000 | 92.000000 | 92.000000 |
| mean | 33439.238095 | 39887.119565 | 37240.951087 |
| std | 24991.531477 | 28349.512533 | 27027.431744 |
| min | 7000.000000 | 9000.000000 | 8300.000000 |
| 25% | 15000.000000 | 19500.000000 | 17500.000000 |
| 50% | 28000.000000 | 31000.000000 | 30000.000000 |
| 75% | 43000.000000 | 52750.000000 | 50375.000000 |
| max | 135000.000000 | 145000.000000 | 137500.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 258 entries, 0 to 1220 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Job 258 non-null object 1 Company 258 non-null object 2 Location 258 non-null object 3 Min Salary 105 non-null float64 4 Max Salary 92 non-null float64 5 Avg Salary 92 non-null float64 dtypes: float64(3), object(3) memory usage: 14.1+ KB
It seems that some salary data is only in the Min Salary column (105 non-null vs. 92 non-null). Those observations correspond to the vacancies in which the published salary was not a range but a single value. Thus, said value was copied into the Max Salary and Avg Salary columns.
# Copying values from Min Salary column to the Max Salary and Avg Salary columns
for index, value in df['Min Salary'].iteritems():
if pd.isnull(df['Max Salary'][index]) and pd.isnull(df['Avg Salary'][index]):
df['Max Salary'][index] = value
df['Avg Salary'][index] = value
df.head()
<ipython-input-44-1a55d367c8db>:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['Max Salary'][index] = value <ipython-input-44-1a55d367c8db>:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['Avg Salary'][index] = value
| Job | Company | Location | Min Salary | Max Salary | Avg Salary | |
|---|---|---|---|---|---|---|
| 0 | Data Analyst | Not Disclosed | Estado de México | NaN | NaN | NaN |
| 1 | Data Analyst | Hitss | Ciudad de México | NaN | NaN | NaN |
| 2 | Data Analyst | Aceros Levinson S.A. D... | Nuevo León | 20000.0 | 25000.0 | 22500.0 |
| 3 | Data Analyst | Sky | Ciudad de México | 17000.0 | 18000.0 | 17500.0 |
| 4 | Data Analyst | Not Disclosed | Guanajuato | 15000.0 | 18000.0 | 16500.0 |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 258 entries, 0 to 1220 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Job 258 non-null object 1 Company 258 non-null object 2 Location 258 non-null object 3 Min Salary 105 non-null float64 4 Max Salary 105 non-null float64 5 Avg Salary 105 non-null float64 dtypes: float64(3), object(3) memory usage: 22.2+ KB
So, the dataset has been successfully cleaned and prepared for the Data Analysis & Visualization.
This section shows the process for exporting the desired dataset.
# Exporting of the cleaned dataset
df.to_csv('Dataset_Clean.csv', index=False, encoding='utf-8')
The second dataframe with only those vacancies whose salary have been disclosed was built in the present section.
# Second dataframe with only disclosed salary data
salary_df = df.dropna(axis = 0, how='any', subset = ['Avg Salary'])
salary_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 105 entries, 2 to 1219 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Job 105 non-null object 1 Company 105 non-null object 2 Location 105 non-null object 3 Min Salary 105 non-null float64 4 Max Salary 105 non-null float64 5 Avg Salary 105 non-null float64 dtypes: float64(3), object(3) memory usage: 5.7+ KB
salary_df.describe()
| Min Salary | Max Salary | Avg Salary | |
|---|---|---|---|
| count | 105.000000 | 105.000000 | 105.000000 |
| mean | 33439.238095 | 38076.333333 | 35757.785714 |
| std | 24991.531477 | 27554.598385 | 26219.917604 |
| min | 7000.000000 | 7000.000000 | 7000.000000 |
| 25% | 15000.000000 | 17000.000000 | 16500.000000 |
| 50% | 28000.000000 | 30000.000000 | 28000.000000 |
| 75% | 43000.000000 | 50000.000000 | 46000.000000 |
| max | 135000.000000 | 145000.000000 | 137500.000000 |
After the dataset was cleaned and prepared, the data was analyzed in order to answer the following questions:
To answer these questions, different plots and charts were drawn using both Matplotlib and Plotly, selecting for reporting the most aesthetic one.
On the other hand, a specific directory named "Figures" was used to store the relevant plots created during the analysis.
if not os.path.exists("Figures"):
os.mkdir("Figures")
To answer this question, a donut chart was plotted.
# Donut chart of the data jobs demand with Matplotlib
fig, ax = plt.subplots(figsize = (8, 13))
colors = sns.color_palette('Blues_r')
explode = [0.02] * 5
wedges, texts, autotexts = ax.pie(x = list(df['Job'].value_counts().values), wedgeprops=dict(edgecolor='w', linewidth= 1),
textprops = dict(size=17, weight="bold", color = 'black'), colors = ['navy',colors[2],colors[3],colors[4],colors[5]],
autopct='%.0f%%', pctdistance=1.15, startangle = 90, counterclock = False, explode = explode)
center = plt.Circle( (0,0), 0.7, color='white')
p = plt.gcf()
p.gca().add_artist(center)
plt.legend(wedges, list(df['Job'].value_counts().keys()), fontsize=13, loc="center right", bbox_to_anchor=(1.2, 0, 0.3, 1))
plt.tight_layout()
plt.savefig('Figures/Fig1_DemandOfDataJobsPerCategoryMPL.png', bbox_inches = 'tight')
plt.show()
# Donut chart of the data jobs demand with Plotly
job_df = pd.DataFrame(df['Job'].value_counts().reset_index().rename(columns = {'index': 'Job', 'Job': 'Count'}))
pie_colors = ['#06477D','#84BDEC','#B4D4EF', '#C8E4FC','aliceblue']
fig = px.pie(job_df, values='Count', names='Job', color = 'Job', hole = 0.7,
color_discrete_sequence=px.colors.sequential.Blues_r,
height= 500,
width = 700,
title='Demand of Data Jobs Per Category')
fig.update_layout(title_x=0.5, font=font_px)
fig.update_traces(hoverinfo='label+percent+name', textinfo='percent', textfont_size=18,
marker=dict(colors=pie_colors, line=dict(color='white', width=4)))
fig.write_image("Figures/Fig1_DemandOfDataJobsPerCategoryPX.png", scale=2)
fig.write_image("Figures/Fig1_DemandOfDataJobsPerCategoryPX.svg", scale=2)
fig.show(config=config)
From the above plot, it is possible to conclude that more than a third of the data jobs (35%) correspond to positions of Data Analyst, rendering them as the most demanded ones in the Mexican labor market at the time of this study (February 2023). In second place, it is possible to found the positions of Data Scientist with about 20% of the total.
Furthermore, it is noteworthy that Data Engineer positions are more demanded than Data Architect positions with about 19% and about 15%, respectively.
On the contrary, Business Analyst positions are the less demanded, with only about 11% out of the total.
To answer this question, a choropleth was prepared using Folium and Plotly.
First, to define the specific location of the markers in the Folium map, a dictionary was defined with the corresponding ID, Latitude and Longitude of the capital cities for each Mexican State.
The IDs correspond to the IDs of the file with the geographical data.
Then, the dictionary was converted into a Pandas dataframe.
# Mexican states dictionary with corresponding ID, Latitude and Longitude
states_dict = {'Aguascalientes': ('AS', 21.87945992, -102.2904135),
'Baja California': ('BC', 32.663214,-115.4903741),
'Baja California Sur': ('BS', 24.1584937,-110.315928),
'Campeche': ('CC', 19.8450352,-90.5381231),
'Chiapas': ('CS', 16.7541485,-93.119001),
'Chihuahua': ('CH', 28.6349557,-106.0777049),
'Coahuila': ('CL', 25.4286965,-100.9994484),
'Colima': ('CM', 19.2408324,-103.7291389),
'Ciudad de México': ('DF', 19.4335493,-99.1344048),
'Durango': ('DG', 24.0241017,-104.6708325),
'Guanajuato': ('GT', 21.0176446,-101.2586863),
'Guerrero': ('GR', 17.5516921,-99.5025877),
'Hidalgo': ('HG', 20.1183855,-98.7540094),
'Jalisco': ('JC', 20.6773775,-103.3494204),
'Estado de México': ('MC', 19.289191,-99.6670425),
'Michoacán': ('MN', 19.7030535,-101.1937953),
'Morelos': ('MS', 18.9218499,-99.2353856),
'Nayarit': ('NT', 21.5122308,-104.8948845),
'Nuevo León': ('NL', 25.6717637,-100.3163831),
'Oaxaca': ('OC', 17.0617935,-96.7271634),
'Puebla': ('PL', 19.0428817,-98.2002919),
'Querétaro': ('QT', 20.37998212, -100.0000308),
'Quintana Roo': ('QR', 18.4978052,-88.3029951),
'San Luis Potosí': ('SP', 22.1521646,-100.9765552),
'Sinaloa': ('SL', 24.8082702,-107.3945828),
'Sonora': ('SR', 29.0748734,-110.9597578),
'Tabasco': ('TC', 17.9882632,-92.9209807),
'Tamaulipas': ('TS', 23.7312703,-99.1517694),
'Tlaxcala': ('TL', 19.3171271,-98.2386354),
'Veracruz': ('VZ', 19.5269375,-96.92401),
'Yucatán': ('YN', 20.9664386,-89.623114),
'Zacatecas': ('ZS', 22.7753476,-102.5740002)}
states_df = pd.DataFrame.from_dict(states_dict, orient='index').reset_index().\
rename(columns={"index": "State", 0: "ID", 1: "Lat", 2: "Long"}).set_index('State')
states_df.head()
| ID | Lat | Long | |
|---|---|---|---|
| State | |||
| Aguascalientes | AS | 21.879460 | -102.290413 |
| Baja California | BC | 32.663214 | -115.490374 |
| Baja California Sur | BS | 24.158494 | -110.315928 |
| Campeche | CC | 19.845035 | -90.538123 |
| Chiapas | CS | 16.754148 | -93.119001 |
After that, with the purpose of providing a color to each Mexican state in the choropleth, the percentage of data jobs for each state was calculated using a value counts. For those states without data jobs, a join was performed and the NaN values were filled with zeros.
# Calculating the Demand Percentage per State
demand_by_state_df = pd.DataFrame(df['Location'].value_counts())
total = sum(demand_by_state_df['Location'])
demand_by_state_df['Percentage'] = (demand_by_state_df['Location']) / total *100
demand_by_state_df = demand_by_state_df.reset_index().rename(columns={"index": "State", "Location": "Count"})
demand_by_state_df = states_df.merge(demand_by_state_df, left_on='State', right_on='State', how = 'outer').fillna(0)
demand_by_state_df
| State | ID | Lat | Long | Count | Percentage | |
|---|---|---|---|---|---|---|
| 0 | Aguascalientes | AS | 21.879460 | -102.290413 | 1.0 | 0.387597 |
| 1 | Baja California | BC | 32.663214 | -115.490374 | 2.0 | 0.775194 |
| 2 | Baja California Sur | BS | 24.158494 | -110.315928 | 0.0 | 0.000000 |
| 3 | Campeche | CC | 19.845035 | -90.538123 | 0.0 | 0.000000 |
| 4 | Chiapas | CS | 16.754148 | -93.119001 | 1.0 | 0.387597 |
| 5 | Chihuahua | CH | 28.634956 | -106.077705 | 2.0 | 0.775194 |
| 6 | Coahuila | CL | 25.428697 | -100.999448 | 0.0 | 0.000000 |
| 7 | Colima | CM | 19.240832 | -103.729139 | 0.0 | 0.000000 |
| 8 | Ciudad de México | DF | 19.433549 | -99.134405 | 145.0 | 56.201550 |
| 9 | Durango | DG | 24.024102 | -104.670833 | 0.0 | 0.000000 |
| 10 | Guanajuato | GT | 21.017645 | -101.258686 | 4.0 | 1.550388 |
| 11 | Guerrero | GR | 17.551692 | -99.502588 | 0.0 | 0.000000 |
| 12 | Hidalgo | HG | 20.118385 | -98.754009 | 2.0 | 0.775194 |
| 13 | Jalisco | JC | 20.677377 | -103.349420 | 23.0 | 8.914729 |
| 14 | Estado de México | MC | 19.289191 | -99.667042 | 14.0 | 5.426357 |
| 15 | Michoacán | MN | 19.703053 | -101.193795 | 0.0 | 0.000000 |
| 16 | Morelos | MS | 18.921850 | -99.235386 | 0.0 | 0.000000 |
| 17 | Nayarit | NT | 21.512231 | -104.894885 | 0.0 | 0.000000 |
| 18 | Nuevo León | NL | 25.671764 | -100.316383 | 30.0 | 11.627907 |
| 19 | Oaxaca | OC | 17.061794 | -96.727163 | 2.0 | 0.775194 |
| 20 | Puebla | PL | 19.042882 | -98.200292 | 1.0 | 0.387597 |
| 21 | Querétaro | QT | 20.379982 | -100.000031 | 4.0 | 1.550388 |
| 22 | Quintana Roo | QR | 18.497805 | -88.302995 | 1.0 | 0.387597 |
| 23 | San Luis Potosí | SP | 22.152165 | -100.976555 | 1.0 | 0.387597 |
| 24 | Sinaloa | SL | 24.808270 | -107.394583 | 3.0 | 1.162791 |
| 25 | Sonora | SR | 29.074873 | -110.959758 | 0.0 | 0.000000 |
| 26 | Tabasco | TC | 17.988263 | -92.920981 | 0.0 | 0.000000 |
| 27 | Tamaulipas | TS | 23.731270 | -99.151769 | 1.0 | 0.387597 |
| 28 | Tlaxcala | TL | 19.317127 | -98.238635 | 0.0 | 0.000000 |
| 29 | Veracruz | VZ | 19.526937 | -96.924010 | 0.0 | 0.000000 |
| 30 | Yucatán | YN | 20.966439 | -89.623114 | 2.0 | 0.775194 |
| 31 | Zacatecas | ZS | 22.775348 | -102.574000 | 0.0 | 0.000000 |
| 32 | Not Disclosed | 0 | 0.000000 | 0.000000 | 19.0 | 7.364341 |
For the Folium map, a dictionary was defined to assign a specific color for each type of data job.
# Dictionary to map job colors
job_colors = {
'Data Analyst': 'blue',
'Business Analyst': 'green',
'Data Engineer': 'black',
'Data Scientist': 'red',
'Data Architect': 'yellow'
}
Later, each vacancy was colored using the dictionary above to map job color.
# Join of the job data dataframe with the Mexican States latitude and longitud
jobs_loc = df.merge(states_df, left_on='Location', right_on='State', how = 'inner')
jobs_loc['Marker Color'] = jobs_loc['Job'].map(job_colors)
jobs_loc.head()
| Job | Company | Location | Min Salary | Max Salary | Avg Salary | ID | Lat | Long | Marker Color | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Data Analyst | Not Disclosed | Estado de México | NaN | NaN | NaN | MC | 19.289191 | -99.667042 | blue |
| 1 | Data Analyst | Grupo Kasa | Estado de México | NaN | NaN | NaN | MC | 19.289191 | -99.667042 | blue |
| 2 | Data Analyst | Atento Servicios, S. A... | Estado de México | 7640.0 | 9900.0 | 8770.0 | MC | 19.289191 | -99.667042 | blue |
| 3 | Data Analyst | Atento Servicios, S. A... | Estado de México | 7600.0 | 9000.0 | 8300.0 | MC | 19.289191 | -99.667042 | blue |
| 4 | Data Analyst | Atento Servicios, S. A... | Estado de México | 7600.0 | 9000.0 | 8300.0 | MC | 19.289191 | -99.667042 | blue |
Then, the choropleth using Folium was created, and the vacancy markers were added.
The JSON file with the geographical data was taken from Arroyo-Velázquez (2022).
# Initialization of Folium map
m = folium.Map(location=[24,-105], zoom_start=5)
# Setting of the Choropleth
folium.Choropleth(
geo_data = 'https://raw.githubusercontent.com/isaacarroyov/data_visualization_practice/master/Python/visualizing_mexican_wildfires_tds/data/states_mx.json',
name = "Data Jobs Localization",
data = demand_by_state_df,
columns = ["ID", "Percentage"],
key_on = "feature.id",
fill_color = "Blues",
fill_opacity = 0.7,
line_opacity = 0.1,
legend_name = "Data Jobs Demand (%)",
).add_to(m)
marker_cluster = MarkerCluster()
m.add_child(marker_cluster)
# Adding of individual markers to clusters
for index, record in jobs_loc.iterrows():
marker = folium.Marker(list(record[7:9]),
icon = folium.Icon(color='white', icon_color=str(record[-1])))
marker_cluster.add_child(marker)
# Setting of the Layer Control
folium.LayerControl().add_to(m)
# Showing the map
m
Finally, a choropleth map with Plotly was drawn.
# Choropleth map with Plotly
fig = px.choropleth(demand_by_state_df,
geojson = 'https://raw.githubusercontent.com/isaacarroyov/data_visualization_practice/master/Python/visualizing_mexican_wildfires_tds/data/states_mx.json',
locations='ID',
color='Percentage',
color_continuous_scale="Blues",
scope="north america",
#title='Demand of Data Jobs per Mexican State',
labels={'Percentage':'National Demand %'},
height= 500,
width = 800
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0}, title_x=0.5, font=font_px)
fig.update_geos(fitbounds="locations", visible=False)
fig.write_image("Figures/Fig2_DemandOfDataJobsPerMexicanStatePX.png", scale=2)
fig.write_image("Figures/Fig2_DemandOfDataJobsPerMexicanStatePX.svg", scale=2)
fig.show(config=config)
The above map strongly suggests that most of the data jobs are concentrated in the capital Mexico City with more than 50% of the demand at the moment of this study (February 2023).
On the other hand, according to the data, Nuevo León, Jalisco, and State of Mexico ("Estado de México" in Spanish) represent distant second places, with only about 10% of the total demand.
Reference for JSON file with the Geo Data:
To answer this question, a bar plot and a heatmap were drawn.
First, a dataframe was prepared using the pivot_table function from Pandas using the location as the index.
The vacancies without specific locations disclosed were removed.
# Preparing dataframe of jobs per location
jobs_per_location_df = (df.pivot_table(index = 'Location', columns = 'Job', values = 'Company', aggfunc = 'count')
.fillna(0).drop("Not Disclosed"))
jobs_per_location_df['Total'] = jobs_per_location_df.sum(axis=1, numeric_only= True)
jobs_per_location_df.head()
| Job | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | Total |
|---|---|---|---|---|---|---|
| Location | ||||||
| Aguascalientes | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
| Baja California | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 2.0 |
| Chiapas | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Chihuahua | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 2.0 |
| Ciudad de México | 15.0 | 45.0 | 22.0 | 30.0 | 33.0 | 145.0 |
As most of the data jobs concentrates in few locations, the top 5 were used to create a new dataframe to be used to draw the bar plots.
# Top 5 locations
top_locations = 5
jobs_per_location_bar_plot_df = (jobs_per_location_df.sort_values('Total', ascending = False)[:top_locations]
.sort_values('Total', ascending = True).drop(columns = 'Total')
)
jobs_per_location_bar_plot_df.head()
| Job | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist |
|---|---|---|---|---|---|
| Location | |||||
| Querétaro | 1.0 | 3.0 | 0.0 | 0.0 | 0.0 |
| Estado de México | 2.0 | 9.0 | 0.0 | 1.0 | 2.0 |
| Jalisco | 2.0 | 8.0 | 4.0 | 3.0 | 6.0 |
| Nuevo León | 5.0 | 13.0 | 5.0 | 3.0 | 4.0 |
| Ciudad de México | 15.0 | 45.0 | 22.0 | 30.0 | 33.0 |
# Bar plot of Jobs by State with Matplotlib
# Stack bar chart of Data jobs per State
jobs_per_location_bar_plot_df.plot.barh(stacked=True,
y = ['Data Analyst', 'Business Analyst', 'Data Engineer',
'Data Scientist', 'Data Architect'],
color=sns.color_palette('Blues_r'),
figsize=(8, 4),
width = 0.85 )
plt.legend(facecolor = 'white', loc = 'center right', title = 'Data Job Category')
plt.xlabel('Vacancies', weight = 'bold')
plt.ylabel('Location', weight = 'bold')
plt.title('Demand Per Data Job Category In the Most Important Locations\n')
plt.savefig('Figures/Fig3_DemandPerDataJobCategoryInTheMostImportantLocationsMPL.png', bbox_inches = 'tight')
plt.show()
# Bar plot of Jobs by State with Plotly
fig = px.bar(jobs_per_location_bar_plot_df, x=['Business Analyst', 'Data Analyst',
'Data Architect', 'Data Engineer', 'Data Scientist'],
y=jobs_per_location_bar_plot_df.index,
color_discrete_sequence=px.colors.sequential.Blues_r,
height = 500,
width = 800,
title = 'Demand Per Data Job Category In the Most Important Locations',
labels = {"value": "Vacancies", "variable": "Data Job Category"},
barmode='stack'#'group'
)
fig.update_layout(title_x=0.5, font=font_px)
fig.write_image("Figures/Fig3_DemandPerDataJobCategoryInTheMostImportantLocationsPX.png", scale=2)
fig.write_image("Figures/Fig3_DemandPerDataJobCategoryInTheMostImportantLocationsPX.svg", scale=2)
fig.show(config=config)
To draw the heatmaps, a new dataframe was built using the previous dataframe joined with the dataframe with the states data. This, with the purpose to draw the heapmap with all the states from Mexico, and not only with those with current vacancies.
# Building Dataframe for Heatmap
jobs_per_location_heatmap_df = (jobs_per_location_df.join(states_df, how = 'outer').fillna(0)
.drop(columns = ['Total','ID', 'Lat', 'Long'])
)
jobs_per_location_heatmap_df.head()
| Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | |
|---|---|---|---|---|---|
| Aguascalientes | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| Baja California | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| Baja California Sur | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Campeche | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Chiapas | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
# Heatmap of Data Jobs per State with Matplotlib
plt.figure(figsize = (6,10))
ax = sns.heatmap(jobs_per_location_heatmap_df,
cmap = "Blues_r",
mask = (jobs_per_location_heatmap_df == 0))
ax.set_facecolor('black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=11)
plt.xlabel('Data Job Category')
plt.ylabel('Location')
plt.xticks(rotation = 330)
plt.title('Demand Per Location & Data Job Category\n')
plt.savefig('Figures/Fig4_DemandPerLocationAndDataJobCategoryMPL.png', bbox_inches = 'tight')
plt.show()
Heatmaps with Plotly require the dataset to be in a tidy (or long) form. So, the data was transformed once again using the method melt from Pandas.
# Converting the Dataset into tidy data
jobs_per_location_heatmap_tidy_df = pd.melt(jobs_per_location_heatmap_df.reset_index().rename(columns = {'index': 'Location'}),
id_vars= 'Location', var_name = 'Job', value_name = 'Vacancies').\
sort_values(by = 'Location', ascending = False)
jobs_per_location_heatmap_tidy_df.head(10)
| Location | Job | Vacancies | |
|---|---|---|---|
| 159 | Zacatecas | Data Scientist | 0.0 |
| 63 | Zacatecas | Data Analyst | 0.0 |
| 95 | Zacatecas | Data Architect | 0.0 |
| 127 | Zacatecas | Data Engineer | 0.0 |
| 31 | Zacatecas | Business Analyst | 0.0 |
| 30 | Yucatán | Business Analyst | 0.0 |
| 158 | Yucatán | Data Scientist | 0.0 |
| 94 | Yucatán | Data Architect | 0.0 |
| 126 | Yucatán | Data Engineer | 1.0 |
| 62 | Yucatán | Data Analyst | 1.0 |
# Heatmap of Data Jobs per State with Plotly
# Heatmap with Plotly
fig = px.density_heatmap(jobs_per_location_heatmap_tidy_df, y='Location',
x = 'Job', z = 'Vacancies',
color_continuous_scale= heatmap_px_colorscale,
#color_continuous_scale="Blues_r",
height=800,
width=700,
title= 'Demand Per Location & Data Job Category',
labels={"Job": "Data Job Category"},
)
fig.update_layout(title_x=0.5, coloraxis_colorbar=dict(title="Vacancies"),
font=font_px)
fig.update_xaxes(categoryorder='array', categoryarray= ['Business Analyst', 'Data Analyst',
'Data Architect', 'Data Engineer', 'Data Scientist'])
fig.write_image("Figures/Fig4_DemandPerLocationAndDataJobCategoryPX.png", scale=2)
fig.write_image("Figures/Fig4_DemandPerLocationAndDataJobCategoryPX.svg", scale=2)
fig.show(config=config)
From the plots above, it is possible to observe that, undoubtedly, Mexico City, Nuevo León, Jalisco, and Estado de México are the locations where most of the data jobs are demanded, while the rest of the country is lagging behind in terms of data jobs creation.
However, it is important to remark that the data jobs demand is largely concentrated in Mexico City, whereas Nuevo León, Jalisco, and Estado de México are distant second, third and fourth places, respectively.
Moreover, it is noteworthy that the Data Analyst position is the one most demanded across the Mexican States; whereas Data Architect is the less demanded.
To answer this question, a treemap was drawn.
So, first, a new dataframe was created using the groupby and count methods from Pandas. Then, the vacancies with Not Disclosed positions were removed. After that, the datarame was sliced to get only the top 15 companies demanding data jobs.
Finally, the labels for the treemap were also sliced to shorten them.
# Building the Dataframe with the top 15 companies demanding data jobs
top_companies = 15
top_companies_df = (df.groupby(by = 'Company', as_index= False)['Job'].count()
.sort_values(by = 'Job', ascending = False)
.rename(columns = {'Job': 'Vacancies'})[:top_companies]
.set_index('Company').drop('Not Disclosed').reset_index()
)
top_companies_df['Company'] = top_companies_df['Company'].apply(lambda x: x[:16])
top_companies_df.head()
| Company | Vacancies | |
|---|---|---|
| 0 | Manpower, S.A. D | 10 |
| 1 | Softtek | 6 |
| 2 | Atento Servicios | 6 |
| 3 | Praxis | 6 |
| 4 | Bairesdev Llc / | 5 |
# Treemap with Plotly
fig = px.treemap(top_companies_df, path = [px.Constant("."), 'Company'], values='Vacancies',
color = 'Vacancies',
color_continuous_scale=px.colors.sequential.Blues,
title= f'Top {top_companies} Companies Demanding Data Jobs',
height= 600,
width = 1000
)
fig.update_layout(title_x=0.5, coloraxis_colorbar=dict(title="Vacancies"), font=font_px)
fig.write_image("Figures/Fig5_Top15CompaniesDemandingDataJobsPX.png", scale=2)
fig.write_image("Figures/Fig5_Top15CompaniesDemandingDataJobsPX.svg", scale=2)
fig.show(config=config)
From the plot above, it is clear that Manpower, Atento Servicios, Praxis and Softek are the companies with the highest data jobs demand at the moment of the present study. So, even though, the data jobs demand may vary along time, the current interest of such companies in data science and analytics might convert them in interesting prospects for job seekers.
It is also noteworthy that most of the top companies demanding data jobs are recruitment agencies and tech consultancy firms, with the notable exception of BBVA Bancomer, Liverpool and HP.
To answer this question, a heatmap was drawn in order to show what data jobs categories are demanded by company.
As usual, a dataframe with the data for top 30 companies demanding data jobs was prepared using the method pivot_table from Pandas.
Of course, the vacancies without a disclosed company were removed from the dataset.
# Dataframe for Heatmap with top 30 companies demanding data jobs
top_companies_data_jobs = 30
jobs_per_company_df = (pd.pivot_table(data = df, index = 'Company', columns = 'Job',
values = 'Location', aggfunc = 'count')
.fillna(0).reset_index())
jobs_per_company_df['Total'] = jobs_per_company_df.sum(axis=1, numeric_only= True)
jobs_per_company_df = (jobs_per_company_df.sort_values('Total', ascending = False)
.set_index('Company').drop('Not Disclosed')[:top_companies_data_jobs]
.drop(columns = 'Total').sort_values('Company', ascending = True)
)
jobs_per_company_df.head()
| Job | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist |
|---|---|---|---|---|---|
| Company | |||||
| Accenture, S.C. | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| Alcon | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 |
| Alta Rentabilidad Sa D... | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 |
| Atento Servicios, S. A... | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 |
| Bairesdev | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
# Heatmap with Matplotlib
plt.figure(figsize = (6,10))
formatter = mpl.ticker.StrMethodFormatter('{x:,.0f}')
ax = sns.heatmap(jobs_per_company_df,
cmap = "Blues_r",
mask = (jobs_per_company_df == 0),
cbar_kws={"format": formatter})
ax.set_facecolor('black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=11)
plt.xlabel('Data Job Category')
plt.ylabel(f'Company')
plt.xticks(rotation = 330)
plt.title(f'Demand Per Company (Top {top_companies_data_jobs}) & Data Job Category\n')
plt.savefig('Figures/Fig6_DemandPerCompanyAndDataJobCategoryTop30MPL.png', bbox_inches = 'tight')
plt.show()
Likewise, as heatmaps with Plotly require the dataset to be in a tidy (or long) form. The data was transformed once again using the method melt from Pandas.
# Dataframe for Heatmap with top 30 companies demanding data jobs in tidy format
jobs_per_company_tidy_df = pd.melt(jobs_per_company_df.reset_index().sort_values('Company', ascending = False),
id_vars = 'Company', var_name = 'Job', value_name = 'Vacancies')
jobs_per_company_tidy_df.head()
| Company | Job | Vacancies | |
|---|---|---|---|
| 0 | Ust Global | Business Analyst | 0.0 |
| 1 | Tecnológico De Monterrey | Business Analyst | 0.0 |
| 2 | Stefanini México, S.A.... | Business Analyst | 0.0 |
| 3 | Softtek | Business Analyst | 0.0 |
| 4 | Sky | Business Analyst | 0.0 |
# Heatmap with Plotly
fig = px.density_heatmap(jobs_per_company_tidy_df, y='Company',
x = 'Job', z = 'Vacancies',
color_continuous_scale= heatmap_px_colorscale,
height=800,
width=750,
title= f'Demand Per Company (Top {top_companies_data_jobs}) & Data Job Category ',
labels={"Job": "Data Job Category"},
)
fig.update_layout(title_x=0.5, coloraxis_colorbar=dict(title="Vacancies"),
font=font_px)
fig.write_image("Figures/Fig6_DemandPerCompanyAndDataJobCategoryTop30PX.png", scale=2)
fig.write_image("Figures/Fig6_DemandPerCompanyAndDataJobCategoryTop30PX.svg", scale=2)
fig.show(config=config)
From the plots above, it is possible to see that, indeed, Data Analyst, Data Engineer and Data Scientist positions are the ones most demanded by the companies in Mexico at the moment of this study; whereas Data Architect and Business Analyst positions are the less demanded.
Notwithstanding with the above, the heatmap suggests that Business Analyst and Data Engineer positions are more demanded across different organizations. On the contrary, Data Scientist and, certainly, Data Architect vacancies can be found in more specific organizations like tech consulting companies.
To answer this question, a heatmap was drawn in order to show the location(s) of the companies demanding data jobs.
Likewise, a dataframe with the data for top 30 companies demanding data jobs was prepared by using the method pivot_table from Pandas. The vacancies without a disclosed company were removed from the dataset.
# Dataframe for analyzing the location of the companies demanding data jobs
companies_by_location_df = (pd.pivot_table(data = df, index = 'Company', columns = 'Location',
values = 'Job', aggfunc = 'count').transpose()
.join(states_df, how = 'outer').drop(columns =['ID', 'Lat', 'Long'])
.transpose().fillna(0).drop('Not Disclosed')
)
companies_by_location_df['Total'] = companies_by_location_df.sum(axis=1, numeric_only= True)
companies_by_location_df.head()
| Aguascalientes | Baja California | Baja California Sur | Campeche | Chiapas | Chihuahua | Ciudad de México | Coahuila | Colima | Durango | ... | San Luis Potosí | Sinaloa | Sonora | Tabasco | Tamaulipas | Tlaxcala | Veracruz | Yucatán | Zacatecas | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accenture International | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Accenture, S.C. | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 |
| Aceros Levinson S.A. D... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Adecco | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
| Ait Vanguardia Tecnoló... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 34 columns
# All companies and its locations
all_companies_by_location_df = (companies_by_location_df.sort_values('Total', ascending = False)
.drop(columns = 'Total').reset_index().rename(columns = {'index': 'Company'})
.sort_values('Company', ascending = True).set_index('Company').astype(float)
.transpose().sort_index()
)
all_companies_by_location_df.head()
| Company | Accenture International | Accenture, S.C. | Aceros Levinson S.A. D... | Adecco | Ait Vanguardia Tecnoló... | Alcon | Alta Rentabilidad Sa D... | Ami International | Atento Servicios, S. A... | Axented | ... | Universidad Mexicana | Universidad Tecnologic... | Ust Global | Visa | Von Der Heide | Wolters Kluwer | Workable Ats | Zegovia Rh Sa De Cv | Zemsania México | Zurich Aseguradora Mex... |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Aguascalientes | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Baja California | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Baja California Sur | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Campeche | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Chiapas | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 147 columns
# Top 30 companies demanding data jobs and its locations
top_companies_by_location = 30
top_companies_by_location_df = (companies_by_location_df
.sort_values('Total', ascending = False)[:top_companies_by_location]
.drop(columns = 'Total').reset_index()
.rename(columns = {'index': 'Company'})
.sort_values('Company', ascending = True)
.set_index('Company').astype(float)
)
top_companies_by_location_df.head()
| Aguascalientes | Baja California | Baja California Sur | Campeche | Chiapas | Chihuahua | Ciudad de México | Coahuila | Colima | Durango | ... | Quintana Roo | San Luis Potosí | Sinaloa | Sonora | Tabasco | Tamaulipas | Tlaxcala | Veracruz | Yucatán | Zacatecas | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Company | |||||||||||||||||||||
| Accenture, S.C. | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| Adecco | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Alcon | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Alta Rentabilidad Sa D... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Atento Servicios, S. A... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 33 columns
# Heatmap of the top 30 companies demanding data jobs and their locations in Matplolib
plt.figure(figsize = (12,9))
ax = sns.heatmap(top_companies_by_location_df,
cmap = "Blues_r",
mask = (top_companies_by_location_df == 0))
ax.set_facecolor('xkcd:black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=12)
plt.xlabel('Location')
plt.ylabel(f'Company')
plt.title(f'Demand Per Company (Top {top_companies_by_location}) & Location\n')
plt.savefig('Figures/Fig7_DemandPerCompanyAndLocationTop30MPL.png', bbox_inches = 'tight')
plt.show()
Then, to use Plotly, the dataframe was converted into a tidy (or long) format.
# Converting top 30 companies and their locations dataframe into tidy data
top_companies_by_location_tidy_df = pd.melt(top_companies_by_location_df.reset_index().sort_values(by = 'Company', ascending = False),
id_vars = 'Company', var_name = 'Location', value_name = 'Vacancies')
top_companies_by_location_tidy_df.head()
| Company | Location | Vacancies | |
|---|---|---|---|
| 0 | Ust Global | Aguascalientes | 0.0 |
| 1 | Stefanini México, S.A.... | Aguascalientes | 0.0 |
| 2 | Softtek | Aguascalientes | 1.0 |
| 3 | Sky | Aguascalientes | 0.0 |
| 4 | Resources Connection M... | Aguascalientes | 0.0 |
# Heatmap of the top 30 companies demanding data jobs and their locations in Plotly
fig = px.density_heatmap(top_companies_by_location_tidy_df, y='Company',
x = 'Location', z = 'Vacancies',
color_continuous_scale= heatmap_px_colorscale,
height=800,
width=1100,
title= f'Demand Per Company (Top {top_companies_by_location}) & Location'
)
fig.update_layout( title_x=0.5,
coloraxis_colorbar=dict(title="Vacancies"), font=font_px)
fig.update_xaxes(tickangle = 270)
fig.write_image("Figures/Fig7_DemandPerCompanyAndLocationTop30PX.png", scale=2)
fig.write_image("Figures/Fig7_DemandPerCompanyAndLocationTop30PX.svg", scale=2)
fig.show(config=config)
As expectable, most of the companies locate in Mexico City as the large majority of the vacancies are offered there. However, the heatmap shows that there are some organizations that are spread across several Mexican states such as Praxis or Softek.
Futhermore, there are few well-known companies whose data jobs demand is not located in the capital region, such as HP which is located in Jalisco.
Let's take a look to the location of all companies demanding data jobs in February 2023.
# Heatmap of all companies demanding data jobs and their locations with Matplotlib
plt.figure(figsize = (12,9))
ax = sns.heatmap(all_companies_by_location_df,
cmap = "Blues_r",
mask = (all_companies_by_location_df == 0))
ax.set_facecolor('xkcd:black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=12)
plt.xlabel('Companies Demanding Data Jobs')
plt.ylabel('Location')
plt.xticks([])
plt.title('Demand Per Company & Location\n')
plt.savefig('Figures/Fig8_DemandPerCompanyAndLocationMPL.png', bbox_inches = 'tight')
plt.show()
# Converting all companies and their locations dataframe into tidy data
all_companies_by_location_tidy_df = (all_companies_by_location_df.reset_index()
.rename(columns = {'index': 'Location'})
.set_index('Location')
.transpose().reset_index()
.sort_values('Company', ascending = True)
)
all_companies_by_location_tidy_df = pd.melt(all_companies_by_location_tidy_df,
id_vars = 'Company', var_name= 'Location',
value_name = 'Vacancies').sort_values(by='Location', ascending = False)
all_companies_by_location_tidy_df.head()
| Company | Location | Vacancies | |
|---|---|---|---|
| 4850 | Zurich Aseguradora Mex... | Zacatecas | 0.0 |
| 4758 | G. De Villa Y Asociados | Zacatecas | 0.0 |
| 4756 | Fusemachines | Zacatecas | 0.0 |
| 4755 | Fundación Rafael Donde... | Zacatecas | 0.0 |
| 4754 | Farmacia San Pablo | Zacatecas | 0.0 |
# Heatmap of all companies demanding data jobs and their locations with Plotly
fig = px.density_heatmap(all_companies_by_location_tidy_df, y='Location',
x = 'Company', z = 'Vacancies',
color_continuous_scale= heatmap_px_colorscale,
height=800,
width=1000,
title= 'Demand Per Company & Location',
)
fig.update_layout(title_x=0.5, coloraxis_colorbar=dict(title="Vacancies"), font=font_px)
fig.update_xaxes(showticklabels=False)
fig.write_image("Figures/Fig8_DemandPerCompanyAndLocationPX.png", scale=2)
fig.write_image("Figures/Fig8_DemandPerCompanyAndLocationPX.svg", scale=2)
fig.show(config=config)
Likewise, the above heatmap beautifully shows that most of the data jobs are concentrated in Ciudad de México and, to a lesser extent, in Nuevo León and Jalisco.
It is also important to bear in mind that, from now on, the second dataset with only the vacancies with disclosed salary information: _salarydf was used for analysis.
To answer this question, a bar plot was drawn. And, to draw such a figure, the corresponding dataset was prepared by using the methods groupby and count from Pandas.
# Count of observations per Data Job Category
obs_per_job_df = (salary_df.groupby('Job')[['Avg Salary']].count()
.rename(columns = {'Avg Salary': 'Observations'})
.sort_values('Observations', ascending = True)
)
obs_per_job_df.head()
| Observations | |
|---|---|
| Job | |
| Data Architect | 8 |
| Business Analyst | 10 |
| Data Scientist | 16 |
| Data Engineer | 21 |
| Data Analyst | 50 |
# Bar plot of the salary observations per data job category in Matplotlib
obs_per_job_plt = obs_per_job_df.plot.barh(color = sns.color_palette('Blues_r')[0],
figsize = (7, 4), alpha = 0.6, width = 0.9)
plt.legend([])
plt.xlabel('Observations with Disclosed Salary')
plt.ylabel('Data Job Category')
plt.title('Number of Salary Observations Per Data Job Category\n')
plt.savefig('Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategoryMPL.png', bbox_inches = 'tight')
plt.show()
# Bar plot of the salary observations per data job category in Plotly
# Defining bar colors
colors = ['#84BDEC',]*5
colors[0] = px.colors.sequential.Blues[8]
# Bar plot
fig = px.bar(obs_per_job_df.reset_index(), x='Observations', y='Job',
color = 'Observations',
color_continuous_scale=px.colors.sequential.Blues,
title= 'Number of Salary Observations Per Data Job Category',
labels={"Job": "Data Job Category"},
height = 450,
width = 700
)
fig.update_layout(title_x=0.5, font=font_px)
fig.update_traces(marker_color=colors, marker_line_color=px.colors.sequential.Blues[8],
marker_line_width=1, opacity=0.7)
fig.write_image("Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategoryPX.png", scale=2)
fig.write_image("Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategoryPX.svg", scale=2)
fig.show(config=config)
/usr/local/lib/python3.8/dist-packages/numpy/core/numeric.py:2449: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
In view of the plots above, it is necessary to be careful with the salary results for the Data Architect positions, as less than 10 observations were collected.
To answer this question, bar charts and boxplots were drawn based on the salaries for each data job category.
# Dataframe with the Avg Monthly Salary per Data Job Category
salary_per_job_df = (salary_df.groupby('Job')[['Avg Salary']].mean()
.reset_index()
.sort_values('Avg Salary', ascending = False)
)
salary_per_job_df.head()
| Job | Avg Salary | |
|---|---|---|
| 2 | Data Architect | 86562.500000 |
| 3 | Data Engineer | 49567.047619 |
| 4 | Data Scientist | 39687.500000 |
| 0 | Business Analyst | 32415.000000 |
| 1 | Data Analyst | 21240.190000 |
# Average Salary per Data Job Category with Matplotlib
plt.figure(figsize = (7, 5))
ax = sns.barplot(data = salary_per_job_df, x = 'Avg Salary', y = 'Job',
order = ['Data Architect', 'Data Engineer', 'Data Scientist', 'Business Analyst', 'Data Analyst'],
palette = ['navy', 'cornflowerblue', 'cornflowerblue', 'cornflowerblue', 'cornflowerblue'],
alpha = 0.7, edgecolor='navy')
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.legend([])
plt.xlabel('Average Monthly Salary (MXN)')
plt.ylabel('Data Job Category')
plt.grid({'both'})
plt.title('Average Salary Per Data Job Category\n')
plt.savefig('Figures/Fig10_AverageSalaryPerDataJobCategoryMPL.png', bbox_inches = 'tight')
plt.show()
# Average Salary per Data Job Category with Plotly
colors = ['#84BDEC',]*5
colors[0] = px.colors.sequential.Blues[8]
order = ['Data Analyst', 'Business Analyst', 'Data Scientist', 'Data Engineer', 'Data Architect']
fig = px.bar(salary_per_job_df, x='Avg Salary', y='Job',
color = 'Avg Salary',
color_continuous_scale=px.colors.sequential.Blues,
title= 'Average Salary Per Data Job Category',
labels={"Job": "Data Job Category", 'Avg Salary': "Average Monthly Salary (MXN)"},
height = 500,
width = 700,
opacity=0.8
)
fig.update_layout(yaxis={'categoryorder':'array', 'categoryarray': order}, title_x=0.5, font=font_px)
fig.update_traces(marker_color=colors, marker_line_color=px.colors.sequential.Blues[8], marker_line_width=1)
fig.update_xaxes(tickformat = '$,~s')
fig.write_image("Figures/Fig10_AverageSalaryPerDataJobCategoryPX.png", scale=2)
fig.write_image("Figures/Fig10_AverageSalaryPerDataJobCategoryPX.svg", scale=2)
fig.show(config=config)
/usr/local/lib/python3.8/dist-packages/numpy/core/numeric.py:2449: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
From the bar charts above, it is clear that Data Architect positions are the data job category with the highest monthly average salary with about $86,500 MXN.
Furthermore, the second place corresponds to the Data Engineer positions with an average monthly salary of about $49,500 MXN.
The rest of the data job positions have the following average monthly salaries:
Data Scientist: $39,687
Business Analyst: $32,415
Data Analyst: $21,240
However, the calculated monthly average salaries might be affected by outliers. So, the monthly median salaries were estimated with the help of the boxplots below.
# Box plot for each Data Job Category with Matplotlib
plt.figure(figsize = (8,5))
ax = sns.boxplot(data = salary_df, x = 'Job', y = 'Avg Salary',
order = ['Data Architect', 'Data Engineer', 'Data Scientist', 'Business Analyst', 'Data Analyst'],
palette = 'Blues_r')
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.xlabel('Data Job Category')
plt.ylabel('Average Monthly Salary (MXN)')
plt.title('Salary Per Data Job Category\n')
plt.savefig('Figures/Fig11_SalaryPerDataJobCategoryMPL.png', bbox_inches = 'tight')
plt.show()
# Box plot for each Data Job Category with Plotly
fig = px.box(salary_df, x = "Job", y = "Avg Salary",
color = "Job", points="all",
color_discrete_sequence=px.colors.sequential.Blues_r,
category_orders={"Job": ['Data Architect', 'Data Engineer', 'Data Scientist', 'Business Analyst', 'Data Analyst']},
labels={
"Avg Salary": "Average Monthly Salary (MXN)",
"Job": "Data Job Category"},
title='Salary Per Data Job Category',
height=550,
width=750
)
fig.update_layout(title_x=0.5, font=font_px)
fig.update_traces(showlegend=False)
fig.update_yaxes(tickformat = '$,~s')
fig.write_image("Figures/Fig11_SalaryPerDataJobCategoryPX.png", scale=2)
fig.write_image("Figures/Fig11_SalaryPerDataJobCategoryPX.svg", scale=2)
fig.show(config=config)
The boxplot suggests that, after removing outliers from the calculation of the salaries distributions, Data Architect positions are still the ones with the highest salaries in the current Mexican labor market with a monthly median salary of about 75,000 MXN.
The second monthly highest median salary correspond to Data Engineer positions with about 50,000 MXN.
On the other hand, it was found that Data Scientist and Business Analyst positions have a median monthly salary of about 39,750 MXN and about 32,000 MXN, respectively.
Finally, Data Analyst positions are the ones with the lowest salaries in the Data Jobs labor market with only a median monthly salary of about 16,750 MXN.
Please refer to the section 5. Statistical Analysis below of the present notebook for a set of statistical tests applied to the salaries of each data job category.
To answer this question, a heatmap was drawn to show the relationship among location, data job category and average salary.
To do so, the appropriate dataframe was build using the methods pivot_table and join from Pandas. In this sense, the pivot salary table by location was joined with the states dataframe in order to display all the states within Mexico, and not only those with vacancies. Moreover, not disclosed locations were removed.
Finally, heatmaps were drawn using Matplotlib and Plotly.
# Dataframe with averages salaries per location and data job category
salary_location_df = (pd.pivot_table(data = salary_df, index = 'Location',
columns = 'Job', values = 'Avg Salary',
aggfunc= 'mean')
.join(states_df, how = 'outer').drop(columns =['ID', 'Lat', 'Long'])
.fillna(0).sort_index(ascending= True).drop('Not Disclosed')
)
salary_location_df.head()
| Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | |
|---|---|---|---|---|---|
| Aguascalientes | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Baja California | 0.0 | 31000.0 | 0.0 | 0.0 | 0.0 |
| Baja California Sur | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Campeche | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Chiapas | 0.0 | 12000.0 | 0.0 | 0.0 | 0.0 |
# Heatmap of the Salaries per Data Job category and Location with Matplotlib
plt.figure(figsize = (7,8))
formatter = mpl.ticker.StrMethodFormatter('${x:,.0f}')
ax = sns.heatmap(salary_location_df,
cmap = "Blues_r",
mask = (salary_location_df == 0),
cbar_kws={"format": formatter})
ax.set_facecolor('black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=12)
plt.xlabel('Data Job Category')
plt.ylabel('Location')
plt.title(f'Salary Per Location And Data Job Category\n')
plt.xticks(rotation = 330)
plt.savefig('Figures/Fig12_SalaryPerLocationAndDataJobCategoryMPL.png', bbox_inches = 'tight')
plt.show()
# Transforming dataframe into tidy format
salary_location_tidy_df = salary_location_df.rename_axis('Location').reset_index()
salary_location_tidy_df = pd.melt(salary_location_tidy_df, id_vars= 'Location', var_name = 'Job', value_name = 'Salary')
salary_location_tidy_df = salary_location_tidy_df.sort_values('Location', ascending = False)
salary_location_tidy_df.head()
| Location | Job | Salary | |
|---|---|---|---|
| 159 | Zacatecas | Data Scientist | 0.0 |
| 63 | Zacatecas | Data Analyst | 0.0 |
| 95 | Zacatecas | Data Architect | 0.0 |
| 127 | Zacatecas | Data Engineer | 0.0 |
| 31 | Zacatecas | Business Analyst | 0.0 |
# Heatmap of the Salaries per Data Job category and Location with Plotly
fig = px.density_heatmap(salary_location_tidy_df, y='Location', x = 'Job', z = 'Salary',
histfunc="avg",
color_continuous_scale=heatmap_px_colorscale,
#color_continuous_scale="Blues",
height=800,
width=750,
title='Salary Per Location And Data Job Category',
labels={
'Job': 'Data Job Category'
}
)
fig.update_layout(title_x=0.5, font=font_px, coloraxis_colorbar=dict(title="Mth. Salary (MXN)"))
fig.update_traces(colorbar_title_text='Monthly Salary (MXN)')
fig.update_coloraxes(colorbar_tickformat = '$,~s')
fig.update_xaxes(categoryorder='array', categoryarray= ['Business Analyst', 'Data Analyst',
'Data Architect', 'Data Engineer', 'Data Scientist'])
fig.write_image("Figures/Fig12_SalaryPerLocationAndDataJobCategoryPX.png", scale=2)
fig.write_image("Figures/Fig12_SalaryPerLocationAndDataJobCategoryPX.svg", scale=2)
fig.show(config=config)
In general, the above plot suggests that the highest salaries can be found in Mexico City, Nuevo León, Jalisco and Sinaloa. However the observation for the latter state is atypical and should be interpreted with caution.
More specifically, the highest salaries for the different data jobs categories can be found in the following states:
| Data Job Category | Locations with the Highest Average Salaries |
|---|---|
| Business Analyst | Ciudad de México |
| Data Analyst | Chihuahua |
| Data Architect | Ciudad de México |
| Data Engineer | Jalisco |
| Data Scientist | Sinaloa |
To answer this question, a lollipop chart was drawn using the top average salaries per company by using the grouby method from Pandas.
# Dataframe with the top averages salaries per company
top_salary_company = 20
salary_company_df = (salary_df.groupby('Company')[['Avg Salary']].mean()
.drop('Not Disclosed')
.sort_values('Avg Salary', ascending = False)[:top_salary_company]
)
salary_company_df.head()
| Avg Salary | |
|---|---|
| Company | |
| Especialistas En Talen... | 87500.0 |
| Resources Connection M... | 75000.0 |
| Reclutamiento En Tecno... | 75000.0 |
| Zemsania México | 75000.0 |
| Zegovia Rh Sa De Cv | 75000.0 |
# Companies paying the highest salaries with Matplotlib
plt.figure(figsize = (8,7))
ax = sns.scatterplot(data = salary_company_df,
x = 'Avg Salary',
y = 'Company',
color = 'royalblue'
)
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
for index, value in salary_company_df.sort_values('Avg Salary')['Avg Salary'].iteritems():
plt.hlines(xmax = value, xmin = np.min(salary_company_df['Avg Salary']), y = index, color='royalblue') # Stems
plt.xlabel('Average Monthly Salary (MXN)')
plt.ylabel(f'Company')
plt.xticks(rotation = 0)
plt.title(f'Top {top_salary_company} Companies Paying The Highest Salaries\n')
plt.savefig('Figures/Fig13_Top20SalaryPerCompanyMPL.png', bbox_inches = 'tight')
plt.show()
# Companies paying the highest salaries with Plotly
fig = px.scatter(salary_company_df.reset_index().sort_values('Avg Salary'),
x='Avg Salary', y='Company',
height = 700,
width = 800,
title = f'Top {top_salary_company} Companies Paying The Highest Salaries',
labels = {'Avg Salary':'Average Monthly Salary (MXN)'},
)
fig.update_layout(title_x=0.5, font=font_px)
fig.update_traces(marker_color=px.colors.sequential.Blues[7], marker_line_color=px.colors.sequential.Blues[7])
fig.update_xaxes(tickformat = '$,~s')
for index, value in salary_company_df.sort_values('Avg Salary')['Avg Salary'].iteritems():
fig.add_shape(type='line', y0=index, y1= index, x0=np.min(salary_company_df['Avg Salary']), x1= value, xref='x',
yref='y', line=dict(color= px.colors.sequential.Blues[7]))
fig.write_image("Figures/Fig13_Top20SalaryPerCompanyPX.png", scale=2)
fig.write_image("Figures/Fig13_Top20SalaryPerCompanyPX.svg", scale=2)
fig.show(config=config)
From the plot aboves, the companies offering the highest salaries are Especialistas en Talento, Zemsania México, Resources Connection México, Zegovia RH, and Reclutamiento en Tecnología. From those, 3 are recruiting agencies and 2 are tech consulting firms.
To answer this question, a heatmap was drawn to show the relationship among company, data job category and average salary.
To do so, the appropriate dataframe was build using the methods pivot_table and join from Pandas. Moreover, not disclosed employers were removed.
Finally, heatmaps were drawn using Matplotlib and Plotly.
# Dataframe with averages salaries per company and data job category
top_salary_company_per_data_job = 30
salary_company_per_data_job_df = (pd.pivot_table(data = salary_df, index = 'Company',
columns = 'Job', values = 'Avg Salary',
aggfunc= 'mean')
.fillna(0).sort_index(ascending= True).drop('Not Disclosed')
)
salary_company_per_data_job_df['Total Average'] = (salary_company_per_data_job_df[salary_company_per_data_job_df > 0]
.mean(axis=1, numeric_only= True))
salary_company_per_data_job_df = (salary_company_per_data_job_df
.sort_values('Total Average', ascending = False)[:top_salary_company_per_data_job]
.sort_index(ascending= True)
)
salary_company_per_data_job_df.head()
| Job | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | Total Average |
|---|---|---|---|---|---|---|
| Company | ||||||
| Ait Vanguardia Tecnoló... | 0.0 | 0.0 | 0.0 | 67500.0 | 0.0 | 67500.0 |
| Cognodata México, S. D... | 0.0 | 0.0 | 0.0 | 0.0 | 46000.0 | 46000.0 |
| Compartamos Servicios,... | 0.0 | 0.0 | 0.0 | 0.0 | 50000.0 | 50000.0 |
| Desarrollo En Competen... | 0.0 | 0.0 | 0.0 | 0.0 | 39500.0 | 39500.0 |
| Ecosistemex S. De R.L.... | 0.0 | 0.0 | 0.0 | 45000.0 | 0.0 | 45000.0 |
# Heatmap of the Companies Offering the Highest Salaries Per Data Job Category with Matplolib
plt.figure(figsize = (7,10))
formatter = mpl.ticker.StrMethodFormatter('${x:,.0f}')
ax = sns.heatmap(data = salary_company_per_data_job_df.drop(columns = 'Total Average'),
cmap = 'Blues_r',
mask = (salary_company_per_data_job_df.drop(columns = 'Total Average') == 0),
cbar_kws={"format": formatter}
)
ax.set_facecolor('black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=12)
plt.xlabel('Data Job Category')
plt.ylabel('Company')
plt.title(f'Salary Per Company (Top {top_salary_company_per_data_job}) And Data Job Category\n')
plt.xticks(rotation = 330)
plt.savefig('Figures/Fig14_SalaryPerCompanyAndDataJobCategoryMPL.png', bbox_inches = 'tight')
plt.show()
# Converting dataframe into a tidy format
salary_company_per_data_job_tidy_df = (salary_company_per_data_job_df.copy().drop(columns = 'Total Average')
.reset_index()
)
salary_company_per_data_job_tidy_df = pd.melt(salary_company_per_data_job_tidy_df,
id_vars = 'Company', var_name = 'Job', value_name = 'Salary')
salary_company_per_data_job_tidy_df.head()
| Company | Job | Salary | |
|---|---|---|---|
| 0 | Ait Vanguardia Tecnoló... | Business Analyst | 0.0 |
| 1 | Cognodata México, S. D... | Business Analyst | 0.0 |
| 2 | Compartamos Servicios,... | Business Analyst | 0.0 |
| 3 | Desarrollo En Competen... | Business Analyst | 0.0 |
| 4 | Ecosistemex S. De R.L.... | Business Analyst | 0.0 |
# Heatmap of the Companies Offering the Highest Salaries Per Data Job Category with Plotly
fig = px.density_heatmap(salary_company_per_data_job_tidy_df.sort_values(by = 'Company', ascending = False),
y='Company', x = 'Job', z = 'Salary',
histfunc="avg",
color_continuous_scale=heatmap_px_colorscale,
#color_continuous_scale="Blues",
height=900,
width=850,
title= f'Salary Per Company (Top {top_salary_company_per_data_job}) And Data Job Category',
labels={"Job": "Data Job Category"}
)
fig.update_layout(title_x=0.5, coloraxis_colorbar=dict(title="Mth. Salary (MXN)"), font=font_px)
fig.update_xaxes(categoryorder='array', categoryarray= ['Business Analyst', 'Data Analyst', 'Data Architect', 'Data Engineer', 'Data Scientist'])
fig.update_coloraxes(colorbar_tickformat = '$,~s')
fig.write_image("Figures/Fig14_SalaryPerCompanyAndDataJobCategoryPX.png", scale=2)
fig.write_image("Figures/Fig14_SalaryPerCompanyAndDataJobCategoryPX.svg", scale=2)
fig.show(config=config)
For Business Analyst positions, the organizations offering higher salaries are Kelly Services and Manpower.
Moreover, for Data Analyst positions, the organizations offering higher salaries are Especialistas en Talento and Getecsa.
Furthermore, for Data Architect positions, the organizations offering higher salaries are Manpower and Especialistas en Talento.
In addition, for Data Engineer positions, the organizations offering higher salaries are Manpower, Ait Vanguardia Tecnológica, and Stefanini México.
Finally, for Data Scientist positions, the organizations offering higher salaries are Resources Connection and Zegovia Rh.
In this section, the D'Agostino-Pearson normality test was used to assess the normality of the data jobs salary distribution. Then, both parametric (ANOVA and t-test with unequal variance) and non-parametric (Kruskal-Wallis H and Mann-Whitney U) tests were carried out to assess the significance of the obtained mean salaries per data job category.
Finally, an effect size analysis was also carried out by computing the absolute mean salary difference, the Cohen's d, and the bootstrap confidence intervals for each data job category. This, in order to assess whether the salary differences are significant from a practical point of view.
To answer this question, a histogram was drawn with the retrieved salary data.
# Histogram with Matplolib
bins = 8
minh = np.min(salary_df['Avg Salary'])
maxh = np.max(salary_df['Avg Salary'])
plt.figure(figsize = (10, 6))
ax = sns.histplot(data = salary_df, x = 'Avg Salary',
color = sns.color_palette('Blues_r')[0],
bins = bins,
alpha = 0.6
)
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.xlabel('Average Monthly Salary (MXN)')
plt.ylabel('Frequency')
plt.xticks(np.linspace(minh,maxh, bins + 1))
plt.title('Data Jobs Salary Distribution\n')
plt.grid({'both'})
plt.savefig('Figures/Fig15_DataJobsSalaryDistributionMPL.png', bbox_inches = 'tight')
plt.show()
# Histogram with Plotly
size = ( maxh - minh )/bins
ticks = list(np.linspace(minh,maxh,bins + 1))
fig = px.histogram(salary_df, x = 'Avg Salary', nbins= bins, opacity=0.6,
color_discrete_sequence = [px.colors.sequential.Blues_r[0]],
title= 'Data Jobs Salary Distribution',
labels={"Avg Salary": "Average Monthly Salary (MXN)"},
height = 600,
width = 800
)
fig.update_layout(yaxis_title_text='Frequency', bargap=0.01, title_x=0.5,
font=font_px)
fig.update_traces(xbins=dict( # bins used for histogram
start=minh,
end=maxh,
size=size
))
fig.update_xaxes(tickvals=ticks, tickformat = '$,~s')
fig.write_image("Figures/Fig15_DataJobsSalaryDistributionPX.png", scale=2)
fig.write_image("Figures/Fig15_DataJobsSalaryDistributionPX.svg", scale=2)
fig.show(config=config)
Most of the Data Jobs salaries are located in the range from $7,000 MXN to $23,312 MXN per month, whereas monthly salaries superior to $39,625 MXN are more scarce.
On the other hand, it is noteworthy that the salary distributions is skewed to the right. So, it might not comply with the normality assumption.
To answer this question, a D’Agostino-Pearson test of normal distribution was carried out.
$$\mathbf{H_0}: Normal \; Distribution$$ $$\mathbf{H_1}: Not \; H_0$$$$\alpha = 0.05$$# D’Agostino-Pearson test of normal distribution
stat, pvalue = stats.normaltest(salary_df['Avg Salary'])
print(f'The K2 statistic is: {stat:.03f}. \nThe p-value is: {pvalue:.03f}.')
The K2 statistic is: 39.029. The p-value is: 0.000.
Indeed, as the p-value is significant ($p < 0.05$), the null hypothesis that the sample comes from a normal distribution is rejected.
To answer this question, multiple histograms were drawn with the salary data for each data job category.
# Multiple histograms with Matplolib
figure, axis = plt.subplots(nrows=3, ncols=2, figsize = (13,12))
axis[0, 0].hist(salary_df.loc[salary_df['Job'] == 'Business Analyst']['Avg Salary'], color = 'cornflowerblue', edgecolor = 'navy', alpha = 0.6)
axis[0, 0].set_title("Business Analyst")
axis[0, 0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
axis[0, 1].hist(salary_df.loc[salary_df['Job'] == 'Data Analyst']['Avg Salary'], color = 'cornflowerblue', edgecolor = 'navy', alpha = 0.6)
axis[0, 1].set_title("Data Analyst")
axis[0, 1].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
axis[1, 0].hist(salary_df.loc[salary_df['Job'] == 'Data Architect']['Avg Salary'], color = 'cornflowerblue', edgecolor = 'navy', alpha = 0.6)
axis[1, 0].set_title("Data Architect")
axis[1, 0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
axis[1, 1].hist(salary_df.loc[salary_df['Job'] == 'Data Engineer']['Avg Salary'], color = 'cornflowerblue', edgecolor = 'navy', alpha = 0.6)
axis[1, 1].set_title("Data Engineer")
axis[1, 1].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
axis[2, 0].hist(salary_df.loc[salary_df['Job'] == 'Data Scientist']['Avg Salary'], color = 'cornflowerblue', edgecolor = 'navy', alpha = 0.6)
axis[2, 0].set_title("Data Scientist")
axis[2, 0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.savefig('Figures/Fig16_SalaryDistributionsPerDataJobCategoryMPL.png', bbox_inches = 'tight')
plt.show()
# Multiple histograms with Plotly
bins = 10
fig = make_subplots(
rows=3, cols=2,
subplot_titles=("Data Analyst", "Business Analyst", "Data Architect", "Data Engineer", "Data Scientist"),
specs=[[{"colspan": 2}, None],
[{}, {}],
[{}, {}]]
)
fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Data Analyst']['Avg Salary'], nbinsx= bins),
row=1, col=1)
fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Business Analyst']['Avg Salary'], nbinsx= bins),
row=2, col=1)
fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Data Architect']['Avg Salary'], nbinsx= bins),
row=2, col=2)
fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Data Engineer']['Avg Salary'], nbinsx= bins),
row=3, col=1)
fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Data Scientist']['Avg Salary'], nbinsx= bins),
row=3, col=2)
fig.update_layout(height=900, width=800, font=font_px,
title_text="Salary Distributions Per Data Job Category",
title_x=0.5,
bargap=0.02
)
fig.update_traces(marker_color=px.colors.sequential.Blues[8], marker_line_color=px.colors.sequential.Blues[8],
marker_line_width=1, opacity = 0.6, showlegend=False)
fig.update_xaxes(tickformat = '$,~s')
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=1, col=1)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=2, col=1)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=2, col=2)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=3, col=1)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=3, col=2)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=1, col=1)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=2, col=1)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=2, col=2)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=3, col=1)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=3, col=2)
fig.write_image("Figures/Fig16_SalaryDistributionsPerDataJobCategoryPX.png", scale=2)
fig.write_image("Figures/Fig16_SalaryDistributionsPerDataJobCategoryPX.svg", scale=2)
fig.show(config=config)
Indeed, only the salary observations for the Data Scientist and Data Engineer positions somewhat resemble a normal distribution, whereas it is not possible to state the same with the observations for the other data job categories. Thus a normal assumption cannot be hold.
Nonetheless, for the purposes of the present study, both parametric (ANOVA and T-test with unequal variance) and non-parametric (Kruskal-Wallis H and Mann-Whitney U) tests were carried out to assess the significance of the obtained results.
To answer this question, an ANOVA and a Kruskal-Wallis H test were performed on the salary data.
$$\mathbf{H_0}: \mu_{BA} = \mu_{DA} = \mu_{DR} = \mu_{DE} = \mu_{DS}$$$$\mathbf{H_1}: Not \; H_0$$$$\alpha = 0.05$$First, the salary observations for each data job category were prepared from the dataset using the pivot_table method from Pandas.
# Salary pivot table by data job category
salary_pivot_df = pd.pivot_table(salary_df, index = 'Company', columns = 'Job', values = 'Avg Salary', aggfunc= 'mean')
salary_pivot_df.head()
| Job | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist |
|---|---|---|---|---|---|
| Company | |||||
| Aceros Levinson S.A. D... | NaN | 22500.0 | NaN | NaN | NaN |
| Adecco | 39000.0 | 22500.0 | NaN | NaN | NaN |
| Ait Vanguardia Tecnoló... | NaN | NaN | NaN | 67500.0 | NaN |
| Alta Rentabilidad Sa D... | NaN | 7500.0 | NaN | NaN | NaN |
| Atento Servicios, S. A... | NaN | 8535.0 | NaN | NaN | NaN |
Then, a Numpy array was defined for each data job category and the NaN values were dropped.
# Array of salary observations by data job category
BA = salary_pivot_df['Business Analyst'].dropna().values
DA = salary_pivot_df['Data Analyst'].dropna().values
DR = salary_pivot_df['Data Architect'].dropna().values
DE = salary_pivot_df['Data Engineer'].dropna().values
DS = salary_pivot_df['Data Scientist'].dropna().values
Then, the parametric ANOVA test was carried out:
# ANOVA with outliers
stat, pvalue = stats.f_oneway(BA, DA, DR, DE, DS)
print(f'The F statistic is: {stat:.03f}. \nThe p-value is: {pvalue:.03f}.\n')
alpha = 0.05
if pvalue > alpha:
print('Test interpretation: Fail to reject H0.')
else:
print('Test interpretation: Reject H0.')
The F statistic is: 13.963. The p-value is: 0.000. Test interpretation: Reject H0.
# ANOVA using the Statsmodels Formula API
model = smf.ols(formula = "Salary ~ C(Job)", data = salary_df.rename(columns = {'Avg Salary': 'Salary'})).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table
| sum_sq | df | F | PR(>F) | |
|---|---|---|---|---|
| C(Job) | 3.555042e+10 | 4.0 | 24.72355 | 3.030677e-14 |
| Residual | 3.594793e+10 | 100.0 | NaN | NaN |
Moreover, the non-parametric Kruskal-Wallis H test was also performed:
# Kruskal-Wallis H Test
stat, pvalue = stats.kruskal(BA, DA, DR, DE, DS)
print(f'The H statistic is: {stat:.03f}. \nThe p-value is: {pvalue:.03f}.\n')
alpha = 0.05
if pvalue > alpha:
print('Test interpretation: Fail to reject H0.')
else:
print('Test interpretation: Reject H0.')
The H statistic is: 27.973. The p-value is: 0.000. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the salary differences among the different data jobs are statistically significant.
To answer this question, the T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations for that pair of data jobs categories.
$$\mathbf{H_0}: \mu_{DR} ≤ \mu_{DS}$$$$\mathbf{H_1}: \mu_{DR} > \mu_{DS}$$$$\alpha = 0.05$$As the T-tests with unequal variance and the Mann-Whitney U tests were carried out several times to compare the mean salaries of the different data jobs categories, a function was defined as follows:
# Function to compare the mean salaries using a T-test and the Mann-Whitney U test
def compare_means(sample1, sample2, alternative, equal_var = False, alpha = 0.05):
"""
This functions performs both the T-test and the Mann-Whitney U test
to compare the means of two independent samples.
Parameters
sample1: First sample array.
sample2: Second sample array.
alternative: Alternative hypothesis. If 'greater', the mean of the distribution
underlying the first sample is greater than the mean of the distribution underlying
the second sample. If 'less', the mean of the distribution underlying the first
sample is less than the mean of the distribution underlying the second sample.
Finally, if ‘two-sided’, the means of the distributions underlying the samples are unequal.
equal_var: If True, the test performs a standard independent 2 sample test that
assumes equal population variances. If False (default), perform Welch’s t-test,
which does not assume equal population variance.
alpha: Significance level. It is 0.05 by default.
Returns
stat_t: The calculated t-statistic.
pvalue_t: The associated p-value or the chosen alternative from the t-test.
stat_u: The Mann-Whitney U statistic corresponding with the first sample.
pvalue_u: The associated p-value for the chosen alternative from the Mann-Whitney U test.
"""
# t-test
stat_t, pvalue_t = stats.ttest_ind(a = sample1, b = sample2, equal_var = equal_var, alternative = alternative)
# Mann-Whitney U test
stat_u, pvalue_u = stats.mannwhitneyu(x = sample1, y = sample2, alternative = alternative)
# Print of results
print(f'T-test: The t statistic is {stat_t:.03f}; and the p-value is {pvalue_t:.03f}.\n')
print(f'Mann-Whitney U test: The U statistic is {stat_u:.03f}; and the p-value is {pvalue_u:.03f}.\n')
# Interpretation
alpha = 0.05
if pvalue_t > alpha and pvalue_u > alpha:
print('Test interpretation: Fail to reject H0.')
elif pvalue_t < alpha and pvalue_u < alpha:
print('Test interpretation: Reject H0.')
else:
print('Inconsistent results between the tests.')
return stat_t, pvalue_t, stat_u, pvalue_u
# Comparing Data Architect and Data Scientist salaries
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DR, DS, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 2.864; and the p-value is 0.013. Mann-Whitney U test: The U statistic is 74.500; and the p-value is 0.004. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Architect salaries are significantly higher than those for Data Scientists.
Then, the difference between the mean salaries for each data job category were calculated in order to quantify the size of the effect as well as the confidence intervals.
However, as the salary observations do not follow a Gaussian distribution, the confidence intervals for the mean were estimated using bootstrapping and the effect size was primarily calculated in absolute terms. Nonetheless, for sake of completeness, Cohen's d was also estimated.
Likewise, as the effect size was estimated several times for each data job salary comparison, several functions were defined as follows:
# Function for estimating the Absolute Effect Size
def absolute_effect_size(sample1, sample2):
"""
This functions estimates the effect size of the two mean samples by calculating the
absolute difference between them.
Parameters
sample1: First sample array.
sample2: Second sample array.
Returns
abs_effect_size: The absolute effect size of the two mean samples.
"""
abs_effect_size = abs(np.mean(sample1) - np.mean(sample2))
return abs_effect_size
# Function for estimating the Standardized Effect Size (Cohen's d)
def cohens_d(sample1, sample2):
"""
This functions estimates the standardized effect size of the two mean samples
by calculating the Cohen's d.
Parameters
sample1: First sample array.
sample2: Second sample array.
Returns
d: The standardized effect size of the two mean samples (Cohen's d).
"""
# Lenght of the samples
n1, n2 = len(sample1), len(sample2)
# Variances of the samples
s1, s2 = np.var(sample1, ddof=1), np.var(sample2, ddof=1)
# Pooled standard deviation
s = np.sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2))
# Mean of the samples
u1, u2 = np.mean(sample1), np.mean(sample2)
# Cohen's d
d = (u1 - u2) / s
return d
# Function for estimating the bootstrap confidence intervals
def boot_conf_int(sample, alpha = 0.05, iterations = 1000):
"""
This functions estimates the bootstrap confidence intervals of a sample mean.
Parameters
sample: Sample array.
alpha: Significance level. It is 0.05 by default.
iterantions: Number of iterations. It is 1000 by default.
Returns
stats: List with the estimated statistics.
lower_ci: Lower confidence interval.
upper_ci: Upper confidence interval.
"""
n_iterations = iterations
n_size = int(len(sample) * 0.50)
stats = []
for i in range(n_iterations):
new_sample = resample(sample, n_samples=n_size)
mean_stat = np.mean(new_sample)
stats.append(mean_stat)
p = (alpha / 2) * 100
lower_ci = np.percentile(stats, p)
p = ((1 - alpha) + (alpha / 2)) * 100
upper_ci = np.percentile(stats, p)
return stats, lower_ci, upper_ci
# Function for estimating the Effect Size
def effect_size(sample1, sample2, legend1 = "Sample 1", legend2 = "Sample 2", alpha = 0.05):
"""
This functions estimates the effect size of the two mean samples by calculating the
absolute effect size, the Cohen's d, and the bootstrap confidence intervals.
Parameters
sample1: First sample array.
sample2: Second sample array.
legend1: String with the legend for the sample 1.
legend2: String with the legend for the sample 2.
alpha: Significance level. It is 0.05 by default.
Returns
None
"""
# Absolute Effect Size
abs_effect_size = absolute_effect_size(sample1, sample2)
print(f'The mean salary difference between the two data job categories is: ${abs_effect_size:,.0f}.\n')
# Standardized Effect Size (Cohen's d)
d = cohens_d(sample1, sample2)
# Interpretation of Cohen's d
print(f"The Cohen's d between the two data job categories is: {d:,.2f}.")
if d <= 0.20:
print(f"(Small Effect Size)\n")
elif d <= 0.50:
print(f"(Medium Effect Size)\n")
else:
print(f"(Large Effect Size)\n")
# Legends list
legends = [legend1, legend2]
# Adding plot
fig, ax = plt.subplots(figsize = (9, 5))
for index, sample in enumerate([sample1, sample2]):
# Bootstrap confidence intervals
stats, lower_ci, upper_ci = boot_conf_int(sample, alpha = alpha, iterations = 1000)
print(f"{legends[index]}'s Salary --> {(1 - alpha)*100}% Confidence Interval: (${lower_ci:,.0f}, ${upper_ci:,.0f})")
ax.hist(stats, color = sns.color_palette('Blues_r')[3-index],
alpha = 0.65, label = f'{legends[index]}')
print('\n')
plt.title(f'Bootstrap Distributions for {legend1} and {legend2} Salaries\n')
plt.xlabel('Average Monthly Salary (MXN)')
plt.ylabel('Frequency')
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.legend(loc = 'upper right', facecolor = 'white')
plt.savefig(f'Figures/Fig_BootstrapDistributions_{legend1}-{legend2}_MPL.png', bbox_inches = 'tight')
plt.show()
# Estimating the Effect Size of the mean salary difference
effect_size(DR, DS, "Data Architect", "Data Scientist", alpha = 0.05)
The mean salary difference between the two data job categories is: $41,875. The Cohen's d between the two data job categories is: 1.73. (Large Effect Size) Data Architect's Salary --> 95.0% Confidence Interval: ($51,667, $116,667) Data Scientist's Salary --> 95.0% Confidence Interval: ($26,925, $54,002)
So, from the effect size analysis, it is possible to conclude that the mean salary difference between Data Architect and Data Scientist positions is, not only statistically significant, but also practically significant.
Certainly, a difference of $41,875 MXN per month is huge in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for both positions are very different from each other.
To answer this question and the following ones, the T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the mean salary difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{DE} ≤ \mu_{DS}$$$$\mathbf{H_1}: \mu_{DE} > \mu_{DS}$$$$\alpha = 0.05$$# Comparing Data Engineer and Data Scientist salaries
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DE, DS, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 0.457; and the p-value is 0.326. Mann-Whitney U test: The U statistic is 104.500; and the p-value is 0.391. Test interpretation: Fail to reject H0.
The obtained p-values from the T-test and the Mann-Whitney U test were not significant ($p > 0.05$). Therefore, the Data Engineer salaries are not significantly higher than those for Data Scientists.
# Effect size analysis
effect_size(DE, DS, "Data Engineer", "Data Scientist", alpha = 0.05)
The mean salary difference between the two data job categories is: $3,458. The Cohen's d between the two data job categories is: 0.17. (Small Effect Size) Data Engineer's Salary --> 95.0% Confidence Interval: ($28,122, $58,216) Data Scientist's Salary --> 95.0% Confidence Interval: ($27,357, $54,073)
From the effect size analysis, it is possible to conclude that the mean salary difference between Data Engineer and Data Scientist positions is also practically not significant.
Certainly, a difference of $3,458 MXN per month is neglectable in the Mexican labor market for technology. And, in this sense, it is noteworthy that the bootstrap confidence intervals for both positions overlap in a large extent as it is clearly shown in the plot above.
# Comparing Data Scientist and Data Analyst salaries
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DS, DA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 2.829; and the p-value is 0.005. Mann-Whitney U test: The U statistic is 361.000; and the p-value is 0.001. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Scientists salaries are significantly higher than those for Data Analysts.
# Effect size analysis
effect_size(DS, DA, "Data Scientist", "Data Analyst", alpha = 0.05)
The mean salary difference between the two data job categories is: $16,562. The Cohen's d between the two data job categories is: 0.99. (Large Effect Size) Data Scientist's Salary --> 95.0% Confidence Interval: ($26,641, $53,718) Data Analyst's Salary --> 95.0% Confidence Interval: ($17,419, $31,315)
From the effect size analysis, it is possible to conclude that the mean salary difference between Data Scientist and Data Analyst positions is, not only statistically significant, but also practically significant.
Certainly, a difference of $16,562 MXN per month is important in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for both positions are different from each other as shown in the plot above.
# Comparing Data Scientist and Business Analyst salaries
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DS, BA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 1.229; and the p-value is 0.117. Mann-Whitney U test: The U statistic is 72.000; and the p-value is 0.296. Test interpretation: Fail to reject H0.
The obtained p-values from the T-test and the Mann-Whitney U test were not significant ($p > 0.05$). Therefore, the Data Scientist salaries are not significantly higher than those for Business Analysts.
# Effect size analysis
effect_size(DS, BA, "Data Scientist", "Business Analyst", alpha = 0.05)
The mean salary difference between the two data job categories is: $7,136. The Cohen's d between the two data job categories is: 0.44. (Medium Effect Size) Data Scientist's Salary --> 95.0% Confidence Interval: ($26,000, $55,502) Business Analyst's Salary --> 95.0% Confidence Interval: ($25,844, $39,602)
In this case, even though the mean salary difference between Data Scientist and Business Analyst positions is not statistically significant, a difference of $7,136 MXN per month is not neglectable as this quantity represents about the national average salary in 2022.
Therefore, notwithstanding with their lack of statistical significance, based on the effect size analysis, it is possible to state that there is a significant practical difference between the mean salary difference of the Data Scientist and the Business Analyst positions.
# Comparing Data Architect and Data Engineer salaries
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DR, DE, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 2.610; and the p-value is 0.018. Mann-Whitney U test: The U statistic is 73.000; and the p-value is 0.006. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Architects salaries are significantly higher than those for Data Engineers.
# Effect size analysis
effect_size(DR, DE, "Data Architect", "Data Engineer", alpha = 0.05)
The mean salary difference between the two data job categories is: $38,417. The Cohen's d between the two data job categories is: 1.55. (Large Effect Size) Data Architect's Salary --> 95.0% Confidence Interval: ($51,667, $116,667) Data Engineer's Salary --> 95.0% Confidence Interval: ($29,712, $58,861)
From the effect size analysis, it is possible to conclude that the mean salary difference between Data Architect and Data Engineer positions is, not only statistically significant, but also practically significant.
Certainly, a difference of $38,417 MXN per month is important in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for both positions are different from each other as shown in the plot above.
# Comparing Data Engineer and Business Analyst salaries
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DE, BA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 1.749; and the p-value is 0.049. Mann-Whitney U test: The U statistic is 83.500; and the p-value is 0.104. Inconsistent results between the tests.
The obtained p-value from the T-test was significant ($p < 0.05$); whereas the obtained p-value from the Mann-Whitney U test was not significant ($p > 0.05$).
In this context, taking into account that the Data Engineer and Business Analysts salary observations do not follow a normal distribution, the results from the Mann-Whitney U test could be deemed as more accurate.
Therefore, all in all, the Data Engineer salaries are not significantly higher than those for Business Analyst.
# Effect size analysis
effect_size(DE, BA, "Data Engineer", "Business Analyst", alpha = 0.05)
The mean salary difference between the two data job categories is: $10,594. The Cohen's d between the two data job categories is: 0.63. (Large Effect Size) Data Engineer's Salary --> 95.0% Confidence Interval: ($29,096, $57,813) Business Analyst's Salary --> 95.0% Confidence Interval: ($25,750, $40,200)
In this case, even though the mean salary difference between Data Engineers and Business Analyst positions is not statistically significant, a difference of $10,594 MXN per month is practically significant as this quantity represents more than national average salary in 2022.
Therefore, notwithstanding with their lack of statistical significance, based on the effect size analysis, it is possible to state that there is a significant practical difference between the mean salary difference of the Data Engineers and the Business Analyst positions.
# Comparing Data Engineer and Data Analyst salaries
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DE, DA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 3.280; and the p-value is 0.002. Mann-Whitney U test: The U statistic is 365.000; and the p-value is 0.001. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Engineers salaries are significantly higher than those for Data Analysts.
# Effect size analysis
effect_size(DE, DA, "Data Engineer", "Data Analyst", alpha = 0.05)
The mean salary difference between the two data job categories is: $20,020. The Cohen's d between the two data job categories is: 1.18. (Large Effect Size) Data Engineer's Salary --> 95.0% Confidence Interval: ($29,498, $58,191) Data Analyst's Salary --> 95.0% Confidence Interval: ($17,125, $31,691)
From the effect size analysis, it is possible to conclude that the mean salary difference between Data Engineer and Data Analyst positions is, not only statistically significant, but also practically significant.
Certainly, a difference of $20,020 MXN per month is important in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for both positions are different from each other as shown in the plot above.
# Comparing Business Analyst and Data Analyst salaries
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(BA, DA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 2.538; and the p-value is 0.009. Mann-Whitney U test: The U statistic is 239.000; and the p-value is 0.003. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Business Analyst salaries are significantly higher than those for Data Analysts.
# Effect size analysis
effect_size(BA, DA, "Business Analyst", "Data Analyst", alpha = 0.05)
The mean salary difference between the two data job categories is: $9,426. The Cohen's d between the two data job categories is: 0.66. (Large Effect Size) Business Analyst's Salary --> 95.0% Confidence Interval: ($25,875, $39,627) Data Analyst's Salary --> 95.0% Confidence Interval: ($17,085, $31,774)
From the effect size analysis, it is possible to conclude that the mean salary difference between Business Analyst and Data Analyst positions is, not only statistically significant, but also practically significant.
Certainly, a difference of $9,426 MXN per month is important in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for both positions are different from each other as shown in the plot above.
The data job category with the highest salary in the Mexican labor market in February 2023 according to the OCC website was Data Architect. Indeed, the average salary for Data Architect positions was higher than for both Data Engineers and Data Scientists. Thus, the present study's hypothesis is rejected.
On the other hand, the data job category most demanded in the Mexican labor market was Data Analyst; even though it was also the one with the lowest salary. Also, this data job category is the most demanded across the different Mexican states, whereas Data Architect and Data Scientist positions were the most concentrated in certain locations, namely, Ciudad de México, Nuevo León, and Jalisco.
Moreover, Ciudad de México was the location where it is possible to find the highest jobs demand and the highest salaries. However, Nuevo León, Estado de México, and Jalisco were locations where the demand of data jobs and the salaries offered are the highest after the capital.
Furthermore, the companies with the greatest demand of data positions were Manpower, Atento Servicios, Praxis and Softek; nevertheless, the organizations that offered the highest salaries were Especialistas en Talento, Zemsania México, Resources Connection México, Zegovia RH, and Reclutamiento en Tecnología, which correspond to recruiting agencies and tech consulting firms.
Additionally, it was also found that the data jobs demand from some companies spread across several Mexican states such as Accenture, Praxis or Softek, and that there are some well-known organizations whose data jobs demand is not located in the capital region, such as HP (Jalisco).
The results of the present study suggest that Data Analyst, Business Analyst, and Data Engineer positions were more demanded across different organizations. On the contrary, Data Scientist and, certainly, Data Architect vacancies could only be found in more specific organizations like tech companies or tech consultancy firms.
Finally, regarding the limitations of the present study, it is important to bear in mind that the data was collected solely from the OCC website and only for a very short period of time. Thus, very few observations were obtained for the least demanded data jobs categories: Data Architect and Business Analyst. Also, the collected data mostly corresponded to Ciudad de México, Nuevo León, Estado de México, and Jalisco, and no distinction was made among entry level, middle and senior positions. Thus, as future perspectives, it would be advisable to gather data from more job websites, retrieve information for a longer time span, make a distinction among entry level, middle and senior positions, and collect more salary data for Data Architect and Business Analyst positions as well as for other Mexican states.
# Compressing all figures in a ZIP file
!zip -r figures.zip Figures
adding: Figures/ (stored 0%) adding: Figures/Fig8_DemandPerCompanyAndLocationPX.svg (deflated 72%) adding: Figures/Fig4_DemandPerLocationAndDataJobCategoryMPL.png (deflated 18%) adding: Figures/Fig_BootstrapDistributions_Data Architect-Data Engineer_MPL.png (deflated 19%) adding: Figures/Fig3_DemandPerDataJobCategoryInTheMostImportantLocationsPX.png (deflated 27%) adding: Figures/Fig14_SalaryPerCompanyAndDataJobCategoryPX.png (deflated 15%) adding: Figures/Fig13_Top20SalaryPerCompanyMPL.png (deflated 17%) adding: Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategoryPX.svg (deflated 83%) adding: Figures/Fig15_DataJobsSalaryDistributionMPL.png (deflated 26%) adding: Figures/Fig_BootstrapDistributions_Business Analyst-Data Analyst_MPL.png (deflated 19%) adding: Figures/Fig14_SalaryPerCompanyAndDataJobCategoryMPL.png (deflated 18%) adding: Figures/Fig10_AverageSalaryPerDataJobCategoryMPL.png (deflated 20%) adding: Figures/Fig4_DemandPerLocationAndDataJobCategoryPX.svg (deflated 83%) adding: Figures/Fig13_Top20SalaryPerCompanyPX.svg (deflated 90%) adding: Figures/Fig1_DemandOfDataJobsPerCategoryPX.png (deflated 14%) adding: Figures/Fig11_SalaryPerDataJobCategoryMPL.png (deflated 20%) adding: Figures/Fig11_SalaryPerDataJobCategoryPX.png (deflated 19%) adding: Figures/Fig14_SalaryPerCompanyAndDataJobCategoryPX.svg (deflated 83%) adding: Figures/Fig2_DemandOfDataJobsPerMexicanStatePX.svg (deflated 57%) adding: Figures/Fig16_SalaryDistributionsPerDataJobCategoryPX.svg (deflated 92%) adding: Figures/Fig12_SalaryPerLocationAndDataJobCategoryPX.svg (deflated 84%) adding: Figures/Fig_BootstrapDistributions_Data Engineer-Data Analyst_MPL.png (deflated 19%) adding: Figures/Fig15_DataJobsSalaryDistributionPX.png (deflated 36%) adding: Figures/Fig_BootstrapDistributions_Data Scientist-Business Analyst_MPL.png (deflated 19%) adding: Figures/Fig7_DemandPerCompanyAndLocationTop30MPL.png (deflated 18%) adding: Figures/Fig6_DemandPerCompanyAndDataJobCategoryTop30MPL.png (deflated 18%) adding: Figures/Fig16_SalaryDistributionsPerDataJobCategoryPX.png (deflated 29%) adding: Figures/Fig_BootstrapDistributions_Data Engineer-Data Scientist_MPL.png (deflated 19%) adding: Figures/Fig11_SalaryPerDataJobCategoryPX.svg (deflated 91%) adding: Figures/Fig4_DemandPerLocationAndDataJobCategoryPX.png (deflated 15%) adding: Figures/Fig8_DemandPerCompanyAndLocationPX.png (deflated 24%) adding: Figures/Fig_BootstrapDistributions_Data Scientist-Data Analyst_MPL.png (deflated 19%) adding: Figures/Fig12_SalaryPerLocationAndDataJobCategoryMPL.png (deflated 17%) adding: Figures/Fig7_DemandPerCompanyAndLocationTop30PX.svg (deflated 76%) adding: Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategoryMPL.png (deflated 18%) adding: Figures/Fig8_DemandPerCompanyAndLocationMPL.png (deflated 22%) adding: Figures/Fig6_DemandPerCompanyAndDataJobCategoryTop30PX.svg (deflated 76%) adding: Figures/Fig7_DemandPerCompanyAndLocationTop30PX.png (deflated 16%) adding: Figures/Fig15_DataJobsSalaryDistributionPX.svg (deflated 86%) adding: Figures/Fig6_DemandPerCompanyAndDataJobCategoryTop30PX.png (deflated 12%) adding: Figures/Fig13_Top20SalaryPerCompanyPX.png (deflated 21%) adding: Figures/Fig2_DemandOfDataJobsPerMexicanStatePX.png (deflated 11%) adding: Figures/Fig10_AverageSalaryPerDataJobCategoryPX.png (deflated 27%) adding: Figures/Fig5_Top15CompaniesDemandingDataJobsPX.svg (deflated 87%) adding: Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategoryPX.png (deflated 25%) adding: Figures/Fig3_DemandPerDataJobCategoryInTheMostImportantLocationsMPL.png (deflated 17%) adding: Figures/Fig16_SalaryDistributionsPerDataJobCategoryMPL.png (deflated 24%) adding: Figures/Fig5_Top15CompaniesDemandingDataJobsPX.png (deflated 22%) adding: Figures/Fig_BootstrapDistributions_Data Architect-Data Scientist_MPL.png (deflated 20%) adding: Figures/Fig10_AverageSalaryPerDataJobCategoryPX.svg (deflated 83%) adding: Figures/Fig1_DemandOfDataJobsPerCategoryPX.svg (deflated 80%) adding: Figures/Fig12_SalaryPerLocationAndDataJobCategoryPX.png (deflated 15%) adding: Figures/Fig1_DemandOfDataJobsPerCategoryMPL.png (deflated 11%) adding: Figures/Fig_BootstrapDistributions_Data Engineer-Business Analyst_MPL.png (deflated 19%) adding: Figures/Fig3_DemandPerDataJobCategoryInTheMostImportantLocationsPX.svg (deflated 88%)
# Code for composing the Python Requirements File
def get_imported_packages():
"""
Function to get imported packages to current notebook.
Parameters
None
Returns
modules: List of imported packages
"""
p = get_installed_distributions()
p = {package.key:package.version for package in p}
imported_modules = set(sys.modules.keys())
#imported_modules.remove('pip')
modules = [(m, p[m]) for m in imported_modules if p.get(m, False)]
return modules
def generate_requirements(filepath:str, modules):
"""
Function to print a set of packages into a text file.
Parameters
filepath: String with the name of the output text file
modules: List of the packages to be printed in the output text file
Returns
None
"""
with open(filepath, 'w') as f:
for module, version in modules:
f.write(f"{module}=={version}\n")
generate_requirements('requirements.txt', get_imported_packages())
# End